26
Introduction To SolrCloud Varun Thacker

Introduction to SolrCloud

Embed Size (px)

DESCRIPTION

A brief introduction to Solr and where it fits in your application, followed by an introduction to SolrCloud.

Citation preview

Page 1: Introduction to SolrCloud

Introduction To SolrCloud Varun Thacker

Page 2: Introduction to SolrCloud

Apache Solr has a huge install base and tremendous momentum

most widely used search solution on the planet. 8M+

total downloads

Solr is both established & growing

250,000+monthly downloads

Solr has tens of thousands of applications in production.

You use Solr everyday.

2500+open Solr jobs.

Activity Summary30 Day summary

Aug 18 - Sep 17 2014

• 128 Commits • 18 Contributors

via https://www.openhub.net/p/solr

12 Month Summary Sep 17, 2013 - Sep 17, 2014

• 1351 Commits • 29 Contributors

Page 3: Introduction to SolrCloud

Solr scalability is unmatched.

• 10TB+ Index Size • 10 Billion+ Documents • 100 Million+ Daily Requests

Page 4: Introduction to SolrCloud

Solr’s scalability is unmatched

Page 5: Introduction to SolrCloud

What is Solr?

• A system built to search text

• A specialized type of database management system

• A platform to build search applications on

• Customizable, open source software

Page 6: Introduction to SolrCloud

Where does Solr fit?

Page 7: Introduction to SolrCloud

What is SolrCloud?

Subset of optional features in Solr to enable and simplify horizontal scaling a search index using

sharding and replication.

Goals scalability, performance, high-availability,

simplicity, and elasticity

Page 8: Introduction to SolrCloud

Terminology• ZooKeeper: Distributed coordination service that provides centralised

configuration, cluster state management, and leader election

• Node: JVM process bound to a specific port on a machine

• Collection: Search index distributed across multiple nodes with same configuration

• Shard: Logical slice of a collection; each shard has a name, hash range, leader and replication factor. Documents are assigned to one and only one shard per collection using a hash-based document routing strategy

• Replica: A copy of a shard in a collection

• Overseer: A special node that executes cluster administration commands and writes updated state to ZooKeeper. Automatic failover and leader election.

Page 9: Introduction to SolrCloud
Page 10: Introduction to SolrCloud

Collection == Distributed Index• A collection is a distributed index defined by:

• named configuration stored in ZooKeeper

• number of shards: documents are distributed across N partitions of the index

• document routing strategy: how documents get assigned to shards

• replication factor: how many copies of each document in the collection

• Collections API:

• curl "http://localhost:8983/solr/admin/collections?action=CREATE&name=punemeetup&replicationFactor=2&numShards=2&collection.configName=myonf

Page 11: Introduction to SolrCloud

DEMO

Page 12: Introduction to SolrCloud

Document Routing

• Each shard covers a hash-range

• Default: Hash ID into 32-bit integer, map to range

• leads to balanced (roughly) shards

• Custom-hashing

• Tri-level: app!user!doc

• Implicit: no hash-range set for shards

Page 13: Introduction to SolrCloud

Replication• Why replicate?

• High-availability

• Load balancing

• How does it work in SolrCloud?

• Near-real-time, not master-slave

• Leader forwards to replicas in parallel, waits for response

• Error handling during indexing is tricky

Page 14: Introduction to SolrCloud

Distributed Indexing

• Get cluster state from ZK

• Route document directly to leader (hash on doc ID)

• Persist document on durable storage (tlog)

• Forward to healthy replicas

• Acknowledge write succeed to client

Page 15: Introduction to SolrCloud

Shard Leader• Additional responsibilities during indexing only!

Not a master node

• Leader is a replica (handles queries)

• Accepts update requests for the shard

• Increments the _version_ on the new or updated doc

• Sends updates (in parallel) to all replicas

Page 16: Introduction to SolrCloud

Distributed Queries• Query client can be ZK aware or just query via a load

balancer

• Client can send query to any node in the cluster

• Controller node distributes the query to a replica for each shard to identify documents matching query

• Controller node sorts the results from step 3 and issues a second query for all fields for a page of results

Page 17: Introduction to SolrCloud

Scalability / Stability Highlights• All nodes in cluster perform indexing and execute

queries; no master node

• Distributed indexing: No SPoF, high throughput via direct updates to leaders, automated failover to new leader

• Distributed queries: Add replicas to scale-out qps; parallelize complex query computations; fault-tolerance

• Indexing / queries continue so long as there is 1 healthy replica per shard

Page 18: Introduction to SolrCloud

Zookeeper• Is a very good thing ... clusters are a zoo!

• Centralized configuration management

• Cluster state management

• Leader election (shard leader and overseer)

• Overseer distributed work queue

• Live Nodes

• Ephemeral znodes used to signal a server is gone

• Needs 3 nodes for quorum in production

Page 19: Introduction to SolrCloud

Zookeeper: State Management• Keep track of live nodes /live_nodes znode

• ephemeral nodes

• ZooKeeper client timeout

• Collection metadata and replica state in /clusterstate.json

• Every core has watchers for /live_nodes and /clusterstate.json

• Leader election

• ZooKeeper sequence number on ephemeral znodes

Page 20: Introduction to SolrCloud

Other Features/Highlights• Near-Real-Time Search: Documents are visible within a second or so after

being indexed

• Partial Document Update: Just update the fields you need to change on existing documents

• Optimistic Locking: Ensure updates are applied to the correct version of a document

• Transaction log: Better recoverability; peer-sync between nodes after hiccups

• HTTPS

• Use HDFS for storing indexes

• Use MapReduce for building index (SOLR-1301)

Page 21: Introduction to SolrCloud

Solr on YARN

Page 22: Introduction to SolrCloud

Solr on YARN

• Run the SolrClient application :

• Allocate container to run SolrMaster

• SolrMaster requests containers to run SolrCloud nodes

• Solr containers allocated across cluster

• SolrCloud node connects to ZooKeeper

Page 23: Introduction to SolrCloud

More Information on Solr on YARN

• https://lucidworks.com/blog/solr-yarn/

• https://github.com/LucidWorks/yarn-proto

• https://issues.apache.org/jira/browse/SOLR-6743

Page 24: Introduction to SolrCloud

Our users are also pushing the limits

https://twitter.com/bretthoerner/status/476830302430437376

Page 25: Introduction to SolrCloud

Up, up and away!

https://twitter.com/bretthoerner/status/476838275106091008

Page 26: Introduction to SolrCloud

Connect @

https://twitter.com/varunthacker

http://in.linkedin.com/in/varunthacker

[email protected]