SolrCloud - High Availability and Fault Tolerance: Presented by Mark Miller, Cloudera

O C T O B E R 1 1 - 1 4 , 2 0 1 6 • B O S T O N , M A

SolrCloud: High Availability and Fault Tolerance Mark Miller

Software Engineer, Cloudera

3

01Who am I?

I’m Mark Miller I’m a Lucene junkie (2006) I’m a Lucene committer (2008) And a Solr committer (2009) And a member of the ASF (2011) And a former Lucene PMC Chair (2014-2015) I’ve done a lot of core Solr work and co-created SolrCloud

This talk is about how SolrCloud tries to protect your data.

And about some things that should change.

5

01SolrCloud Diagram

6

03Failure Cases (Shards of index can be treated independently)

• A Leader dies (loses ZK connection) • A Replica dies or update from leader to replica fails. • A Replica is partitioned (eg can talk to ZK, but not a shard leader)

R

L

ZK

7

01Replica Recovery

• A replica will recover from the leader on startup. • A replica will recover if an update from the leader to the replica fails. • A replica may recover from the leader in the leader election sync up dance. R

L

ZK

8

01Replica Recovery Dance

• Start Buffering Updates from Leader • Publish Recovering to ZK • Wait for leader to see Recovering State • On first Recovery try, PeerSync • Otherwise full index replication • Commit on leader • Replicate Index • Replay Buffered DocumentsR

L

ZK

RecoveryStrategy

9

01A Replica is Partitioned

• In the early days we half punted on this • Now, when a leader cannot reach a replica, it will put it in LIR in ZK. • A replica in LIR will realize that it must recover before clearing it’s LIR status. • We worked through some bugs, but this is very solid now.

R

L

ZK X

10

01Leader Recovery

• The ‘best effort’ leader recovery dance • If it’s after startup and the last published state is not active, can’t be leader. • Otherwise, try to peer sync with shard. • If success, try to peer sync from replicas to leader. • If any of those sync fails, ask replicas to recover from leader.

R

L

ZK

SyncStrategy / ElectionContext

11

01Leader Election Forward Progress Stall…

• Each replica decides for itself if it thinks it should be leader. • Everyone may think they are unfit. • Only replicas that have last published ACTIVE will attempt to be leader after the first election.

12

01Leader Election Forward Progress Stall…

• While rare, if all replicas in a shard lose their connection to ZK at the same time, no replica will become leader without intervention. • There is a manual API to intervene, but this should be done automatically. • In practice, this tends to happen for reasons that can be ‘tuned’ out of. • Still needs to be improved.

13

01User chooses durability requirements

• You can specify how many replicas you want to see success from to consider an update successful. minRf param. • This won’t fail based on that criteria though - simply flag you in the response. • If you replicate factor is not achieved, that also does not mean the update is rolled back.

14

01User chooses durability requirements

• If we improve some of this… • We can stop trying so hard. • And put it on the user to specify a replication factor that controls how ‘safe’ updates are.

15

JIRA

16

01Handeling Cluster Shutdown / Startup

• What if an old replica returns? • How to ensure every replica participates in election? • What if no replica thinks it should be leader? • Staggered shutdowns? • Explicit cluster commands might help

17

Thank You!

Mark Miller @heismark Software Engineer Cloudera

Technology

SolrCloud - High Availability and Fault Tolerance: Presented by Mark Miller, Cloudera