Upload
nguyenkhuong
View
222
Download
5
Embed Size (px)
Citation preview
Tomas Komenda, Lukas Putna, Miroslav KvasnicaSeznam.cz
Solr: How to index billion phrases from MySQL and HBase
Who we are
• PPC ads, AdWords competitor in CZ
1
• Web portal, search engine in the Czech Republic• 40+ different web services (search, news, email, media, …)
• Lukas Putna, Tomas Komenda, MiroslavKvasnica• Senior developers, team leaders, trainers• MySQL, HBase, Hadoop, Impala, Solr, Hive
What Sklik.cz is
• Advertising data + daily statistics• Provides real-‐time searching, aggregation, filtering and analytics
2
Advertising hierarchy
Account
Campaign Campaign …
Group Group …
Keywords Ads RetargetingPlacements …
queries urls… …
Sklik.cz and data
• Advertising data + daily statistics• Provides real-‐time searching, aggregation, filtering and analytics
3
Account example10M keywords (phrases, each has a list of queries)120 statistical values per keyword per day
With date filter of 1 year42 billion of values3 aggregated sum rows10 GB
Full-‐text searching has to return results for sub-‐term with one or more characters(such term can be prefix, infix, postfix) => billions of combinations per an account
Sklik data (database) ecosystem
4
Full-text search technologies Elasticsearch, Apache Solr, Sphinx, SRCH2
We used Sphinx
• Free open source search server written in C++, lightweight and powerful, SQL friendly
• Sphinx can be used as a stand-‐alone server or as a storage engine (SphinxSE -‐MySQL and its forks)
• We used Sphinx when the automatic scaling wasn’t supported well
• One Sphinx instance per one database shard, application has to decide which instance to use
• Fast searching, easy configuration
• Our data and requirements (index complexity) grew fast => finally, it wasn’t possible to index the data
• We chose Solr because of our Hadoop ecosystem and existing HBase indexers
5
We considered: Apache Solr and Elasticsearch
• Both Open source, high-‐performance, full-‐featured text search engine tools (engines or even databases)
• Both have a distributed version
• Both built on Apache Lucene (and extend it)
• Both very popular all over the world
• Elasticsearch is probably more known and popular in the Czech Republic
6
Apache SolrBrief Introduction
Apache Solr – introduction
• Solr is an open source enterprise search server
• Solr was created by Yonik Seeley in 2004
• Current version is 5.5
• Uses the Lucene library and extends it
• Provides HTTP interface (XML, JSON, CSV, binary)
• Since 2012, Solr has had a distributed version SolrCloud(Hadoop integration)
7
Apache Solr – key features
• Advanced full-‐text search capabilities
• Optimized for high volume web traffic
• Batch full and delta indexing, near real-‐time updating (stream Apache Flume/Kafka – soft commits)
• Adaptable with XML configuration
• Extensible plugin architecture
• Linearly scalable, auto index replication (Hadoop integration)
• Comprehensive web administration interface, statistics …
• A lot of specialized queries: faceted search, ordering, grouping, pseudo-‐join, spatial search, functions …8
Apache Solr - architecture
9
Source:Jan Hoydahl,Migrating Fast to Solrpresentation, Published on Mar 5, 2010,http://www.slideshare.net/janhoy/migrating-‐fast-‐to-‐solr
Apache Solr – architecture from data flow point of view
10
MySQLHBaseFlume
Indexing…
Index
Analyzer,Tokenizer,Filter
Index writer
JSON, XML, CSV …
Import/Update
Searching…
Index searcher Query parserAnd analyzer
Apache Solr – architecture from data flow point of view
11
MySQLHBaseFlume
Indexing…
Index
Analyzer,Tokenizer,Filter
Index writer
JSON, XML, CSV …
Import
Searching…
Index searcher Query parserAnd analyzer
Apache Solr – Data model and hierarchy
12
Solr Instance
Core/Index Core/Index Core/Index
Documents
Field Field Field
Indexing & QueringSolr.xml
Solrconfig.xml
Schema.xml
Apache Solr – schema.xml and fields definition
13
<schema name="Seznam Sklik Campaigns" version="1.5"> ...
<fieldType name="string" class="solr.StrField" sortMissingLast="true" /> <fieldType name="bool" class="solr.BoolField" sortMissingLast="true" />
<fields> <field name="_version_" type="long" indexed="true" stored="true"/> <field name="id" type="string" indexed="true" stored="true" required="true" /> <field name="name" type="text" indexed="true" stored="true" required="true" /> <field name="userId" type="int" indexed="true" stored="false" required="true" /> <field name="nameSimple" type="simpleText" indexed="true" stored="false" required=”false"/> <copyField source="name" dest="nameSimple" />
</fields> <uniqueKey>id</uniqueKey>
...
Apache Solr – Core overview via Admin
Apache Solr – Core overview via Admin
Apache Solr – architecture from data flow point of view
14
MySQLHBaseFlume
Indexing…
Index
Analyzer,Tokenizer,Filter
Index writer
JSON, XML, CSV …
Import
Searching…
Index searcher Query parserAnd analyzer
Apache Solr – Indexing and updating
• Configuration in solrconf.xml and schema.xml and data-‐config.xml
• Request handlers, Update Handlers, Update Procesor Chain, Data Import Handler
• Index operation: add, delete, optimize, commit, rollback …
• Atomic updates – auto commit, soft and hard commit, transaction log for recovery scenario
• Near real-‐time indexing, batch (full and delta) indexing
15
Update HandlersXML, CSV, JSON, (PDF, Word,...)
Data Import Handler
(Database pull, RSS pull, Simple transformation)
Update Processor Chain
(per handler) Index
Lucene
MySQL
<doc><title> PDF
RSS feed
HTTP PostHTTP Post
PULL
PULL
Update Processor Chain
(per handler)
Update Processor Chain
(per handler)
Apache Solr – data-config.xml and MySQL
16
...<dataSource name="node1" type="JdbcDataSource" driver="com.mysql.jdbc.Driver" batchSize="-‐1" url="jdbc:mysql://skdb012.ng.seznam.cz/sklik_node" user="sklik_ro" password=“…"/> …<dataSource name="node12" type="JdbcDataSource" driver="com.mysql.jdbc.Driver" batchSize="-‐1" url="jdbc:mysql://skdb053.ng.seznam.cz/sklik_node" user="sklik_ro" password=“…"/> …
<document name="keywords"><entity name="keyword" dataSource="node1" query="
SELECT CONCAT_WS("!", c.user_id, k.id) AS id, CAST(kl.name AS CHAR(255)) AS name, cu.url, g.id AS groupId, c.id AS campaignId, c.user_id AS userIdFROM keyword k JOIN group` g ON g.id = k.group_id JOIN campaign c ON c.id = g.campaign_idJOIN user u ON u.id = c.user_id JOIN sklik_common.keyword_lexiconkl ON k.keyword_lexicon_id = kl.id LEFT JOIN sklik_common.v_url cu ON k.url_id = cu.idWHERE u.serviced = 0 AND ('${dih.request.user_ids}' = '' OR c.user_id IN (${dih.request.user_ids})) AND ('${dih.request.from}' = '' OR k.id >= '${dih.request.from}') AND ('${dih.request.to}' = '' OR k.id <'${dih.request.to}') AND ('${dih.request.from_timestamp}' = '' OR k.index_date >= '${dih.request.from_timestamp}') " />
...
Apache Solr – data-config.xml via Admin
Apache Solr – Import via Admin
Apache Solr – schema.xml and indexing
17
... <types>
<fieldType name="string" class="solr.StrField" sortMissingLast="true" /> <fieldType name="simpleText" class="solr.TextField" sortMissingLast="true">
<analyzer type="index"> <charFilter class="solr.MappingCharFilterFactory" mapping="foldToASCII.txt"/> <filter class="solr.LowerCaseFilterFactory"/> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="cz.seznam.sklik.solrconf.ModerateNGramFilterFactory" minGramSize="2" maxGramSize="512"/>
</analyzer> <analyzer type="query">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-‐FoldToASCII. txt"/> <filter class="solr.LowerCaseFilterFactory"/> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
</analyzer> </fieldType>
...
Apache Solr – Import stats via Admin
Apache Solr – architecture from data flow point of view
18
MySQLHBaseFlume
Indexing…
Index
Analyzer,Tokenizer,Filter
Index writer
JSON, XML, CSV …
Import
Searching…
Index searcher Query parserAnd analyzer
Apache Solr – schema.xml and searching
19
... <types>
<fieldType name="string" class="solr.StrField" sortMissingLast="true" /> <fieldType name="simpleText" class="solr.TextField" sortMissingLast="true">
<analyzer type="index"> <charFilter class="solr.MappingCharFilterFactory" mapping="foldToASCII.txt"/> <filter class="solr.LowerCaseFilterFactory"/> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="cz.seznam.sklik.solrconf.ModerateNGramFilterFactory" minGramSize="2" maxGramSize="512"/>
</analyzer> <analyzer type="query">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-‐FoldToASCII. txt"/> <filter class="solr.LowerCaseFilterFactory"/> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
</analyzer> </fieldType> ...
Apache Solr – querying
20
http://sksolr2.ng.seznam.cz:8983/solr/test_collection/select?q=name:*dog?&start=10&rows=5&wt=xml<response>
<lst name="responseHeader"> <int name="status">0</int> <int name="QTime">510</int> <lst name="params"> <str name="q">name:*dog*</str>
<str name="indent">true</str> <str name="start">10</str> <str name="rows">1</str> <str name="wt">xml</str><str name="_">1458818597860</str>
</lst> </lst> <result name="response" numFound="82" start="10" maxScore="1.0">
<doc> <str name="name”>Black bulldogs</str> <str name="id">43932!99192</str><long name="_version_">1529083809145815041</long>
</doc> </result>
</response>
Apache Solr – querying
21
../select?q=keyword:hire~0.7&fq=avgCpc:[600+TO+*]+OR+competition:[0.6+TO+*]^2&sort=sum(count,competition)+desc&start=5&wt=json&indent=true{
"responseHeader”:{ "status":0, "QTime":1493,"params":{
"q":”keyword:hire~0.7", "indent":"true", "start":"5", "fq":"avgCpc:[600 TO *] OR competition:[0.6 TO *]^7", "sort":"sum(count, competition) desc", "wt":"json"}},
"response":{"numFound”:21,"start”:5,"docs":[ {
"query":”fire", "count":108, "competition":0.62176, "avgCpc":589.0909, "months":["2015-‐03", "2015-‐04", "2015-‐05”, ……
Apache Solr – querying via Admin
2
Apache Solr – querying stats via Admin
2
SolrCloudBrief Introduction
SolrCloud – introduction and architeture
• Distributed, auto index replication, linearly scalable
• Hadoop and HDFS integration
• “roughly” CP system (good availability), fault tolerant (HA + no single points failure)
• Document routing according to hash ID to int (or custom hashing), each shard covers a hash-‐range
• All nodes in cluster perform indexing and execute queries; no master node
• Terminology: zookeeper, Node, Collection, Replication Factor, Shard, Replica, Leader
23
Java VM
Node 1 (port: 8984)
Solr Web app
collectionshard1 -‐ Leader
collectionshard1 -‐ Replica
Jetty (node 4) on port: 8985
Solr Web app
Zookeeper
Leader Election
Server 2Balancer
HDFS
Java VM
Node 2 (port: 8985)
Solr Web app
collectionshard2-‐ Leader
collectionshard1 -‐ Replica
Solr Web app
Server 2
HDFS
Java VM
Node 3 (port: 8984)
Solr Web app
collectionshard1 -‐ Replica
HDFS
Java VM
Node 4 (port: 8985)
Solr Web app
collectionshard2-‐ Replica
HDFS
Server 2Server 1
Replication
Replication
Sharding
SolrCloud – cloud via Admin
2
SolrCloud – in Seznam.cz
• Two clusters (24 and 8 machines – backup 4 machines), TBs indexes, we use Solr :
• as a Full-‐text search tool for filters on our client’s website
• as a keyword proposal tool (with stats) supporting creating and tuning customer’s advertising
• as a storage for queries and their stats (public accessible via website and API) for our search engine
• We are generally satisfied, we are still fighting with optimal data scaling and query performacebut indexing and availability are very good
25
Solr Web appServer 2
It is all! Question?
Thank you for listening !
Thursday 12:50 PM @ Ballroom F:MySQL and Impala ecosystem
[email protected]@[email protected]
26
Solr Web appServer 2