Tom …€¢ 40+&different&web&services&(search,news,email,media,…) • Lukas&Putna, Tomas& Komenda,MiroslavKvasnica • Senior&developers,team&leaders,trainers • MySQL,HBase,

Tomas Komenda, Lukas Putna, Miroslav KvasnicaSeznam.cz

Solr: How to index billion phrases from MySQL and HBase

Who we are

• PPC ads, AdWords competitor in CZ

1

• Web portal, search engine in the Czech Republic• 40+ different web services (search, news, email, media, …)

• Lukas Putna, Tomas Komenda, MiroslavKvasnica• Senior developers, team leaders, trainers• MySQL, HBase, Hadoop, Impala, Solr, Hive

What Sklik.cz is

• Advertising data + daily statistics• Provides real-‐time searching, aggregation, filtering and analytics

2

Advertising hierarchy

Account

Campaign Campaign …

Group Group …

Keywords Ads RetargetingPlacements …

queries urls… …

Sklik.cz and data

• Advertising data + daily statistics• Provides real-‐time searching, aggregation, filtering and analytics

3

Account example10M keywords (phrases, each has a list of queries)120 statistical values per keyword per day

With date filter of 1 year42 billion of values3 aggregated sum rows10 GB

Full-‐text searching has to return results for sub-‐term with one or more characters(such term can be prefix, infix, postfix) => billions of combinations per an account

Sklik data (database) ecosystem

4

Full-text search technologies Elasticsearch, Apache Solr, Sphinx, SRCH2

We used Sphinx

• Free open source search server written in C++, lightweight and powerful, SQL friendly

• Sphinx can be used as a stand-‐alone server or as a storage engine (SphinxSE -‐MySQL and its forks)

• We used Sphinx when the automatic scaling wasn’t supported well

• One Sphinx instance per one database shard, application has to decide which instance to use

• Fast searching, easy configuration

• Our data and requirements (index complexity) grew fast => finally, it wasn’t possible to index the data

• We chose Solr because of our Hadoop ecosystem and existing HBase indexers

5

We considered: Apache Solr and Elasticsearch

• Both Open source, high-‐performance, full-‐featured text search engine tools (engines or even databases)

• Both have a distributed version

• Both built on Apache Lucene (and extend it)

• Both very popular all over the world

• Elasticsearch is probably more known and popular in the Czech Republic

6

Apache SolrBrief Introduction

Apache Solr – introduction

• Solr is an open source enterprise search server

• Solr was created by Yonik Seeley in 2004

• Current version is 5.5

• Uses the Lucene library and extends it

• Provides HTTP interface (XML, JSON, CSV, binary)

• Since 2012, Solr has had a distributed version SolrCloud(Hadoop integration)

7

Apache Solr – key features

• Advanced full-‐text search capabilities

• Optimized for high volume web traffic

• Batch full and delta indexing, near real-‐time updating (stream Apache Flume/Kafka – soft commits)

• Adaptable with XML configuration

• Extensible plugin architecture

• Linearly scalable, auto index replication (Hadoop integration)

• Comprehensive web administration interface, statistics …

• A lot of specialized queries: faceted search, ordering, grouping, pseudo-‐join, spatial search, functions …8

Apache Solr - architecture

9

Source:Jan Hoydahl,Migrating Fast to Solrpresentation, Published on Mar 5, 2010,http://www.slideshare.net/janhoy/migrating-‐fast-‐to-‐solr

Apache Solr – architecture from data flow point of view

10

MySQLHBaseFlume

Indexing…

Index

Analyzer,Tokenizer,Filter

Index writer

JSON, XML, CSV …

Import/Update

Searching…

Index searcher Query parserAnd analyzer


11

MySQLHBaseFlume

Indexing…

Index


Index writer

JSON, XML, CSV …

Import

Searching…


Apache Solr – Data model and hierarchy

12

Solr Instance

Core/Index Core/Index Core/Index

Documents

Field Field Field

Indexing & QueringSolr.xml

Solrconfig.xml

Schema.xml

Apache Solr – schema.xml and fields definition

13

<schema name="Seznam Sklik Campaigns" version="1.5"> ...

<fieldType name="string" class="solr.StrField" sortMissingLast="true" /> <fieldType name="bool" class="solr.BoolField" sortMissingLast="true" />

<fields> <field name="_version_" type="long" indexed="true" stored="true"/> <field name="id" type="string" indexed="true" stored="true" required="true" /> <field name="name" type="text" indexed="true" stored="true" required="true" /> <field name="userId" type="int" indexed="true" stored="false" required="true" /> <field name="nameSimple" type="simpleText" indexed="true" stored="false" required=”false"/> <copyField source="name" dest="nameSimple" />

</fields> <uniqueKey>id</uniqueKey>

...

Apache Solr – Core overview via Admin

Apache Solr – Core overview via Admin


14

MySQLHBaseFlume

Indexing…

Index


Index writer

JSON, XML, CSV …

Import

Searching…


Apache Solr – Indexing and updating

• Configuration in solrconf.xml and schema.xml and data-‐config.xml

• Request handlers, Update Handlers, Update Procesor Chain, Data Import Handler

• Index operation: add, delete, optimize, commit, rollback …

• Atomic updates – auto commit, soft and hard commit, transaction log for recovery scenario

• Near real-‐time indexing, batch (full and delta) indexing

15

Update HandlersXML, CSV, JSON, (PDF, Word,...)

Data Import Handler

(Database pull, RSS pull, Simple transformation)

Update Processor Chain

(per handler) Index

Lucene

MySQL

<doc><title> PDF

RSS feed

HTTP PostHTTP Post

PULL

PULL


(per handler)


(per handler)

Apache Solr – data-config.xml and MySQL

16

...<dataSource name="node1" type="JdbcDataSource" driver="com.mysql.jdbc.Driver" batchSize="-‐1" url="jdbc:mysql://skdb012.ng.seznam.cz/sklik_node" user="sklik_ro" password=“…"/> …<dataSource name="node12" type="JdbcDataSource" driver="com.mysql.jdbc.Driver" batchSize="-‐1" url="jdbc:mysql://skdb053.ng.seznam.cz/sklik_node" user="sklik_ro" password=“…"/> …

<document name="keywords"><entity name="keyword" dataSource="node1" query="

SELECT CONCAT_WS("!", c.user_id, k.id) AS id, CAST(kl.name AS CHAR(255)) AS name, cu.url, g.id AS groupId, c.id AS campaignId, c.user_id AS userIdFROM keyword k JOIN group` g ON g.id = k.group_id JOIN campaign c ON c.id = g.campaign_idJOIN user u ON u.id = c.user_id JOIN sklik_common.keyword_lexiconkl ON k.keyword_lexicon_id = kl.id LEFT JOIN sklik_common.v_url cu ON k.url_id = cu.idWHERE u.serviced = 0 AND ('${dih.request.user_ids}' = '' OR c.user_id IN (${dih.request.user_ids})) AND ('${dih.request.from}' = '' OR k.id >= '${dih.request.from}') AND ('${dih.request.to}' = '' OR k.id <'${dih.request.to}') AND ('${dih.request.from_timestamp}' = '' OR k.index_date >= '${dih.request.from_timestamp}') " />

...

Apache Solr – data-config.xml via Admin

Apache Solr – Import via Admin

Apache Solr – schema.xml and indexing

17

... <types>

<fieldType name="string" class="solr.StrField" sortMissingLast="true" /> <fieldType name="simpleText" class="solr.TextField" sortMissingLast="true">

<analyzer type="index"> <charFilter class="solr.MappingCharFilterFactory" mapping="foldToASCII.txt"/> <filter class="solr.LowerCaseFilterFactory"/> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="cz.seznam.sklik.solrconf.ModerateNGramFilterFactory" minGramSize="2" maxGramSize="512"/>

</analyzer> <analyzer type="query">

<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-‐FoldToASCII. txt"/> <filter class="solr.LowerCaseFilterFactory"/> <tokenizer class="solr.WhitespaceTokenizerFactory"/>

</analyzer> </fieldType>

...

Apache Solr – Import stats via Admin


18

MySQLHBaseFlume

Indexing…

Index


Index writer

JSON, XML, CSV …

Import

Searching…


Apache Solr – schema.xml and searching

19

... <types>

<fieldType name="string" class="solr.StrField" sortMissingLast="true" /> <fieldType name="simpleText" class="solr.TextField" sortMissingLast="true">

<analyzer type="index"> <charFilter class="solr.MappingCharFilterFactory" mapping="foldToASCII.txt"/> <filter class="solr.LowerCaseFilterFactory"/> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="cz.seznam.sklik.solrconf.ModerateNGramFilterFactory" minGramSize="2" maxGramSize="512"/>

</analyzer> <analyzer type="query">

<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-‐FoldToASCII. txt"/> <filter class="solr.LowerCaseFilterFactory"/> <tokenizer class="solr.WhitespaceTokenizerFactory"/>

</analyzer> </fieldType> ...

Apache Solr – querying

20

http://sksolr2.ng.seznam.cz:8983/solr/test_collection/select?q=name:*dog?&start=10&rows=5&wt=xml<response>

<lst name="responseHeader"> <int name="status">0</int> <int name="QTime">510</int> <lst name="params"> <str name="q">name:*dog*</str>

<str name="indent">true</str> <str name="start">10</str> <str name="rows">1</str> <str name="wt">xml</str><str name="_">1458818597860</str>

</lst> </lst> <result name="response" numFound="82" start="10" maxScore="1.0">

<doc> <str name="name”>Black bulldogs</str> <str name="id">43932!99192</str><long name="_version_">1529083809145815041</long>

</doc> </result>

</response>

Apache Solr – querying

21

../select?q=keyword:hire~0.7&fq=avgCpc:[600+TO+*]+OR+competition:[0.6+TO+*]^2&sort=sum(count,competition)+desc&start=5&wt=json&indent=true{

"responseHeader”:{ "status":0, "QTime":1493,"params":{

"q":”keyword:hire~0.7", "indent":"true", "start":"5", "fq":"avgCpc:[600 TO *] OR competition:[0.6 TO *]^7", "sort":"sum(count, competition) desc", "wt":"json"}},

"response":{"numFound”:21,"start”:5,"docs":[ {

"query":”fire", "count":108, "competition":0.62176, "avgCpc":589.0909, "months":["2015-‐03", "2015-‐04", "2015-‐05”, ……

Apache Solr – querying via Admin

2

Apache Solr – querying stats via Admin

2

SolrCloudBrief Introduction

SolrCloud – introduction and architeture

• Distributed, auto index replication, linearly scalable

• Hadoop and HDFS integration

• “roughly” CP system (good availability), fault tolerant (HA + no single points failure)

• Document routing according to hash ID to int (or custom hashing), each shard covers a hash-‐range

• All nodes in cluster perform indexing and execute queries; no master node

• Terminology: zookeeper, Node, Collection, Replication Factor, Shard, Replica, Leader

23

Java VM

Node 1 (port: 8984)

Solr Web app

collectionshard1 -‐ Leader

collectionshard1 -‐ Replica

Jetty (node 4) on port: 8985

Solr Web app

Zookeeper

Leader Election

Server 2Balancer

HDFS

Java VM

Node 2 (port: 8985)

Solr Web app

collectionshard2-‐ Leader


Solr Web app

Server 2

HDFS

Java VM

Node 3 (port: 8984)

Solr Web app


HDFS

Java VM

Node 4 (port: 8985)

Solr Web app

collectionshard2-‐ Replica

HDFS

Server 2Server 1

Replication

Replication

Sharding

SolrCloud – cloud via Admin

2

SolrCloud – in Seznam.cz

• Two clusters (24 and 8 machines – backup 4 machines), TBs indexes, we use Solr :

• as a Full-‐text search tool for filters on our client’s website

• as a keyword proposal tool (with stats) supporting creating and tuning customer’s advertising

• as a storage for queries and their stats (public accessible via website and API) for our search engine

• We are generally satisfied, we are still fighting with optimal data scaling and query performacebut indexing and availability are very good

25

Solr Web appServer 2

It is all! Question?

Thank you for listening !

Thursday 12:50 PM @ Ballroom F:MySQL and Impala ecosystem

[email protected]@[email protected]

26

Solr Web appServer 2