Upload
toke-eskildsen
View
622
Download
0
Tags:
Embed Size (px)
Citation preview
Toke Eskildsen [email protected] IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 1/26
Scaling Net ArchiveIndexing & Search
IIPC Technical Training Workshop 2014
@TokeEskildsen Low-level search guy
(boss says “System Architect”)
Toke Eskildsen [email protected] IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 2/26
Scaling SolrCloud indexing
● CPU for analysis, bulk read & write for Solr● Homogeneous shards (law of large numbers) ● Solr index update entry point might be
bottleneck (so use more entry points)● Routing overhead● Splitting and moving shards● Schema changes might require parallel rebuild
Toke Eskildsen [email protected] IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 3/26
Static independent shards
● Easy scaling– Predictable resource requirements
● Selective shard rebuilding● Trivial backup● Lower overall requirements
– Half the JVM heap requirements– Single segment→Higher performance
– Less disk cache competition
● Temporal locality– Better disk cache utilization with few users– Hot spot problem with more users– Ranking suffers (in theory)
● No document-level updates! ~250M docs / 900GB shard
Toke Eskildsen [email protected] IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 4/26
Static independent shards search
Shard 01
Shard 02
Shard 03
Searcher 1
ZooKeeper
Toke Eskildsen [email protected] IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 5/26
Building static shards
● Not standard Solr● Sample setup (distribution optional)
– 24 CPU cores (more would be nice)
– 1 Solr indexer @ 40 GB RAM
– 1 Archon tracking (W)ARC files
– 1 Arctika controlling webarchive-discovery (Tika)
– 40 webarchive-discovery (Tika) @ 1 GB RAM
– Final shards: 250M docs, 900GB, fully optimized
Archon + Arctika: https://github.com/netarchivesuite/netsearch
Toke Eskildsen [email protected] IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 6/26
Static independent shards index
Shard 4
Indexer 1
Shard 5
Indexer 2
Shard 1
Shard 2
Shard 3
Searcher 1
WAD = webarchive-discovery from UKWA: https://github.com/ukwa/webarchive-discovery
WAD 1
Arctika 1
WAD 2...
WAD n
WAD 1
Arctika 1
WAD 2...
WAD n
ARC-pathARC-pathARC-pathARC-path
Archon
ARC 1
Storage
ARC 2...
ARC n
Toke Eskildsen [email protected] IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 7/26
Measuring search performance
● Mimick real world scenarios– Unique queries
● preferably logged from production
– Warmed caches
– Concurrent searches (if relevant)
– Measured time, not reported Qtime
● Capture setup data– Index size, shard count, document count, free cache
memory, sar logs
Toke Eskildsen [email protected] IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 8/26
Predicting scaling requirements
● All else is rarely equal– Disk cache / index size ratio
– CPU cores / shard
– Slowest shard dictates total response time
● 3 or more measurement points● Use 2 or more shards● Visualize measurements
Toke Eskildsen [email protected] IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 9/26
SolrCloud distributed search
● Phase 1– Tophits calculation (fast)
– Simple faceting (medium to slow)
● Phase 2– Document resolving (fast)
– Facet fine count (medium to very slow)
● Coordination and merge overhead
Toke Eskildsen [email protected] IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 10/26
Interval popularity (aka long tail)
Toke Eskildsen [email protected] IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 11/26
ms over time
Toke Eskildsen [email protected] IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 12/26
hits, ms
Toke Eskildsen [email protected] IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 13/26
log(hits), ms
Toke Eskildsen [email protected] IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 14/26
Bucketed percentiles (candlesticks)
Toke Eskildsen [email protected] IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 15/26
Abstract search hardware
● IOPS– Needed for concurrent users and/or many shards
● Latency– 1 request = 1 thread / shard (lying a bit)
– Lower latency → more IOPS
● Tapes < Spinning drives < SSDs < RAM– But the truth is in the mix
Toke Eskildsen [email protected] IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 16/26
Case study: Net Archive Search at State and University Library, Denmark
Toke Eskildsen [email protected] IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 17/26
Standard request
● Free-text matching in 6 fields● Phrase matching i 1 field● Grouping on URL (not used in the tests)● Faceting
– URL (~6b uniques, 7b references)
– Host & domain (millions of uniques, 7b references)
– 3 small ones (year, format, public suffix)
Toke Eskildsen [email protected] IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 18/26
Solr version & schema
● Solr 4.8.1 + SOLR-5894 patch (optional)● Piggy backing UKWA work● DocValues on all large facet fields (essential)
Toke Eskildsen [email protected] IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 19/26
Clever Solr config tweaks
This space intentionally left blank
Toke Eskildsen [email protected] IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 20/26
CPU
Toke Eskildsen [email protected] IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 21/26
Disk cache
RAM %index mean median
110 0.49 658 141
98 0.44 1004 170
54 0.24 2164 361
27 0.12 5620 913
7 0.03 8546 3012
Toke Eskildsen [email protected] IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 22/26
Concurrent requests
Toke Eskildsen [email protected] IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 23/26
Concurrent requests (less faceting)
Toke Eskildsen [email protected] IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 24/26
Faceting impact mitigation
Sparse faceting: http://tokee.github.io/lucene-solr/
Toke Eskildsen [email protected] IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 25/26
Fewer, smaller facets
Toke Eskildsen [email protected] IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 26/26
● Measure thrice & visualise● Common Solr rules of thumbs are not always
applicable at Net Archive scale● Static shards makes scaling easier● SSDs works very well for us (22TB costs £7500)● Full distributed faceting is doable but heavy
Danish Net Archive: http://netarkivet.dk/in-english/More Solr tech talk: http://sbdevel.wordpress.com