Upload
hao-chen
View
113
Download
4
Tags:
Embed Size (px)
DESCRIPTION
General
Citation preview
SolrSolrHao Chen 2012.04Hao Chen 2012.04
What is Solr?What is Solr?
SolrSolr is the popular open source is the popular open source enterprise search platform from the enterprise search platform from the Apache Lucene project. Apache Lucene project.
Solr powers the search and Solr powers the search and navigation features of many of the navigation features of many of the world's largest internet sites. world's largest internet sites.
LuceneLucene
Apache Lucene is a high-performance, fuApache Lucene is a high-performance, full-featured text search engine library wrill-featured text search engine library written entirely in Java. It is a technology sutten entirely in Java. It is a technology suitable for nearly any application that reqitable for nearly any application that requires full-text search, especially cross-pluires full-text search, especially cross-platform. atform.
Lucene Vs SolrLucene Vs Solr Lucene is a search library built in Java. Solr is Lucene is a search library built in Java. Solr is
a web application built on top of Lucene. a web application built on top of Lucene. Certainly Solr = Lucene + Added features. OfteCertainly Solr = Lucene + Added features. Ofte
n there would a question, when to choose Solr n there would a question, when to choose Solr and when to choose Lucene.and when to choose Lucene.
To get more control use Lucene. For faster devTo get more control use Lucene. For faster development, easy to learn, choose Solr. elopment, easy to learn, choose Solr.
http://www.findbestopensource.com/article-detail/lucene-vs-solr
Why do we need Solr?Why do we need Solr?
Full-text SearchFull-text Search– MySQL “like %keyword%”MySQL “like %keyword%”
Too slow! And weak!
Major Features of Solr Major Features of Solr
Advanced Full-Text Search CapabilitiesAdvanced Full-Text Search Capabilities Optimized for High Volume Web TrafficOptimized for High Volume Web Traffic Standards Based Open Interfaces - XML,JSON and Standards Based Open Interfaces - XML,JSON and
HTTPHTTP Comprehensive HTML Administration InterfacesComprehensive HTML Administration Interfaces Server statistics exposed over JMX for monitoringServer statistics exposed over JMX for monitoring Scalability - Efficient Replication to other Solr Scalability - Efficient Replication to other Solr
Search ServersSearch Servers Flexible and Adaptable with XML configurationFlexible and Adaptable with XML configuration Extensible Plugin ArchitectureExtensible Plugin Architecture
http://lucene.apache.org/solr/
Typical Application Architecture Typical Application Architecture
Web ServerDatabase (MySQL)
http request
Cache (memcached, Redis, etc.)
Solr / Lucene
DIH
All the components could be distributed, to make the architecture scalable.
Lucene/Solr ArchitectureLucene/Solr Architecture
8
Apache Lucene
/select /spell XML CSVXML Binary JSON
Data Import Handler
(SQL/RSS)
Extracting Request
Handler (PDF/WORD)
CachingFaceting
Query Parsing
Apache Tika
binary/admin
High-lighting
Schema
Index Replication
Request Handlers Update HandlersResponse Writers
QuerySearch Components
Spelling
Faceting
Highlighting Signature
Logging
Update Processors
Indexing
Config
Debug
Statistics
More like this
Distributed Search
Clustering
Filtering Search
Core SearchIndexReader/Searcher
IndexingIndexWriterText Analysis
Analysis
Demo – Demo – A live website powered by Solr
I’ll be showing you more later!
Demo – Demo – The backend of the website
Demo - Demo - Standard directory layout
DemoDemo - - Multiple cores
Demo – Demo – Run Solr!
java -jar start.jar Production enviroment: Production enviroment:
– java -Xms200m -Xmx1400m -jar start.jar >>/home/web_logs/solr/soljava -Xms200m -Xmx1400m -jar start.jar >>/home/web_logs/solr/solr$date.log 2>&1 &r$date.log 2>&1 &
– tailf /home/web_logs/solr/solr20120423.logtailf /home/web_logs/solr/solr20120423.log2012-04-07 14:10:50.516::INFO: Started SocketConnector @ 0.0.0.0:89832012-04-07 14:10:50.516::INFO: Started SocketConnector @ 0.0.0.0:8983
Demo – Demo – Web Admin Interfacehttp://localhost:8983/solr/admin
Demo – Demo – Web Admin Interfacehttp://localhost:8983/solr/admin
• SCHEMA: This downloads the schema configuration file (XML) directly to the browser.• CONFIG: It is similar to the SCHEMA choice, but this is the main configuration file for Solr.• ANALYSIS: It is used for diagnosing potential query/indexing problems having to do with the text analysis. This is a somewhat advanced screen and will be discussed later.•SCHEMA BROWSER: This is a neat view of the schema reflecting various heuristics of the actual data in the index. We'll return here later.•STATISTICS: Here you will find stats such as timing and cache hit ratios. In Chapter 9, we will visit this screen to evaluate Solr's performance.
Demo – Demo – Web Admin Interfacehttp://localhost:8983/solr/admin
• INFO: This lists static versioning information about internal components to Solr. Frankly, it's not very useful.
• DISTRIBUTION: It contains Distributed/Replicated status information, only applicable for such configurations.
• PING: Ignore this, although it can be used for a health-check in distributed mode.
• LOGGING: This allows you to adjust the logging levels for different parts of Solr at runtime. For Jetty as we're running it, this output goes to the console and nowhere else.
QueryQueryIndexingIndexing
QueryQuery INFO: [core1] webapp=/solr path=/admin/ping params={} status=INFO: [core1] webapp=/solr path=/admin/ping params={} status=
0 QTime=2 0 QTime=2 Apr 23, 2012 5:42:46 PM org.apache.solr.core.SolrCore executeApr 23, 2012 5:42:46 PM org.apache.solr.core.SolrCore execute INFO: [core1] webapp=/solr path=/select params={wt=json&rowsINFO: [core1] webapp=/solr path=/select params={wt=json&rows
=100&json.nl=map&start=0&q=searchKeyword:ipad2} hits=48 sta=100&json.nl=map&start=0&q=searchKeyword:ipad2} hits=48 sta
tus=0 QTime=0tus=0 QTime=0
QueryQuery INFO: [] webapp=/solr path=/select params={wt=jsonINFO: [] webapp=/solr path=/select params={wt=json
&rows=20&json.nl=map&start=0&&rows=20&json.nl=map&start=0&sort=volume+descsort=volume+desc&&q=CId:50011744+AND+price:[100+TO+*]} hits=1547 staq=CId:50011744+AND+price:[100+TO+*]} hits=1547 status=0 QTime=41tus=0 QTime=41
q=CId:50011744+AND+price:[100+TO+*] q=CId:50011744+AND+price:[100+TO+*] sort=volume+descsort=volume+desc start=0start=0 rows=20rows=20
hits=1547 status=0 QTime=41hits=1547 status=0 QTime=41
QueryQuery q - q - 查询字符串,必需查询字符串,必需 fl - fl - 指定返回那些字段内容,用逗号或空格分隔多个。指定返回那些字段内容,用逗号或空格分隔多个。 start - start - 返回第一条记录在完整找到结果中的偏移位置,返回第一条记录在完整找到结果中的偏移位置, 00 开始,一般分页用。开始,一般分页用。 rows - rows - 指定返回结果最多有多少条记录,配合指定返回结果最多有多少条记录,配合 startstart 来实现分页。来实现分页。 sort - sort - 排序,格式:排序,格式: sort=<field name>+<desc|asc>[,<field name>+<desc|sort=<field name>+<desc|asc>[,<field name>+<desc|
asc>]… asc>]… 。示例:(。示例:( inStock desc, price ascinStock desc, price asc )表示先 “)表示先 “ inStock” inStock” 降序降序 , , 再 “再 “ price” price” 升序,默认是相关性降序。升序,默认是相关性降序。
wt - (writer type)wt - (writer type) 指定输出格式,可以有 指定输出格式,可以有 xml, json, php, phps, xml, json, php, phps, 后面 后面 solr 1.solr 1.33 增加的,要用通知我们,因为默认没有打开。增加的,要用通知我们,因为默认没有打开。
fq - fq - (( filter queryfilter query )过滤查询,作用:在)过滤查询,作用:在 qq 查询符合结果中同时是查询符合结果中同时是 fqfq 查询查询符合的,例如:符合的,例如: q=mm&fq=date_time:[20081001 TO 20091031]q=mm&fq=date_time:[20081001 TO 20091031] ,找关键,找关键字字 mmmm ,并且,并且 date_timedate_time 是是 2008100120081001 到到 2009103120091031 之间的。之间的。
More: http://wiki.apache.org/solr/CommonQueryParameters
Demo – Demo – PHP Solr Client
Query - DemoQuery - Demo
Indexing DataIndexing Data
Indexing Data - Indexing Data - Communicating with Solr
– Direct HTTP or a convenient client API– Data streamed remotely or from Solr's filesyste
m
Indexing Data - Indexing Data - Data formats/sources
– Solr-XML:
– Solr-binary: This is only supported by the SolrJ client API.
– CSV: CSV is a character separated value format (often a comma).
– Rich documents like PDF, XLS, DOC, PPT
– Solr's DIH DataImportHandler contrib add-on is a powerful capability that can communicate with both databases and XML sources (for example: web services). It supports configurable relational and schema mapping options and supports custom transformation additions if needed. The DIH uniquely supports delta updates if the source data has modification dates.
Lucene/Solr IndexingLucene/Solr Indexing
XML Update Handler
CSV Update Handler
/update /update/csv
XML Update with custom
processor chain
/update/xml
Extracting RequestHandler
(PDF, Word, …)
/update/extract
Lucene Index
Data ImportHandler
Database pullRSS pullSimple
transformsSQL DB
RSS feed
<doc> <title>
Remove Duplicatesprocessor
Loggingprocessor
Indexprocessor
Custom Transformprocessor
HTTP POSTHTTP POST
pull
pull
Update Processor Chain (per handler)
Lucene
Text Index Analyzers
schema.xmlschema.xml
Indexing Data - Indexing Data - Schema
AdvancedAdvanced
Chinese Word Segmentation (Chinese Word Segmentation ( 中文分中文分词词 ))
DIH (Data Import Handler)DIH (Data Import Handler) ShardingSharding ReplicationReplication Performance TuningPerformance Tuning
Chinese Word Segmentation (Chinese Word Segmentation ( 中文分词中文分词 ))
Chinese Word Segmentation (Chinese Word Segmentation ( 中文分词中文分词 ))
Chinese Word Segmentation (Chinese Word Segmentation ( 中文分词中文分词 ))IKAnalyzer3.2.8.jar
Chinese Word Segmentation (Chinese Word Segmentation ( 中文分词中文分词 ))
相关原理请参阅《 解密搜索引擎技术实解密搜索引擎技术实战》战》
DIH (Data Import Handler)DIH (Data Import Handler)
MySQL
jdbc/DIH
Solr
• full-import
• delta-import
Most applications store data in relational databases or XML files and searching over such data is a common use-case.
The DataImportHandler is a Solr contrib that provides a configuration driven way to import this data into Solr in both "full builds" and using incremental delta imports.
DIH (Data Import Handler)DIH (Data Import Handler)
1. Imports data from databases through JDBC (Java Database Connectivity)
2. Imports XML data from a URL (HTTP GET) or a file
3. Can combine data from different tables or sources in various ways
4. Extraction/Transformation of the data
5. Import of updated (delta) data from a database, assuming a last-updated date
6. A diagnostic/development web page
7. Extensible to support alternative data sources and transformation steps
DIH (Data Import Handler)DIH (Data Import Handler)
• curl http://localhost:8983/solr/dataimport to verify the configuration.
• curl http://localhost:8983/solr/dataimport?command=full-import
• curl http://localhost:8983/solr/dataimport?command=delta-import
DIH (Data Import Handler) - Full Import Example DIH (Data Import Handler) - Full Import Example 完全索引完全索引
data-config.xml
DIH (Data Import Handler) - Delta Import Example DIH (Data Import Handler) - Delta Import Example 增量索引增量索引
data-config.xml
DIH (Data Import Handler) - DemoDIH (Data Import Handler) - Demo
2 millions rows imported in about 20 minutes.
Linux aaa 2.6.18-243.el5 #1 SMP Mon Feb 7 18:47:27 EST 2011 x86_64 x86_64 x86_64 GNU/Linux
Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
cpu cores : 1
MemTotal: 2058400 kB
ShardingSharding
Sharding is the process of breaking a single logical index in a horizontal fashion across records versus breaking it up vertically by entities.
S1 S2 S3 S4
Sharding-IndexingSharding-IndexingSHARDS = ['http://server1:8983/solr/', 'http://server2:8983/solr/']
unique_id = document[:id]if unique_id.hash % SHARDS.size == local_thread_id # index to shardend
Sharding-QuerySharding-Query
The ability to search across shards is built into the query request handlers. You do not need to do any special configuration to activate it.
ReplicationReplication
Master
Slaves
Combining replication and sharding
M1 M2 M3Sharding Masters
S1 S2 S3 S1 S2 S3
Slave Pool 1 Slave Pool 2
Queries sent to pools of slave shards
Replication
Combining replication and sharding
http://wiki.apache.org/solr/SolrCloud http://zookeeper.apache.org/doc/r3.3.2/zookeeperOver.html
Performance TuningPerformance Tuning
JVMJVM http cachehttp cache Solr CacheSolr Cache Better schemaBetter schema Better indexing strategyBetter indexing strategy
Solr CachingSolr Caching
Caching is a key part of what makes Solr fast and scalable
There are a number of different caches configured in solrconfig.xml:– filterCache– queryResultCache– documentCache
More InfoMore Info
《《 Solr 1.4 Enterprise Search ServerSolr 1.4 Enterprise Search Server 》》 http://wiki.apache.org/solr/ http://wiki.apache.org/solr/ http://http://solr.plsolr.pl/en//en/ 《解密搜索引擎技术实战》《解密搜索引擎技术实战》
Thank you!Thank you!