Upload
mike-lively
View
193
Download
0
Embed Size (px)
DESCRIPTION
This is an intro to Sphinx and PHP. It will take you through the very basics of how Sphinx works, how you can set up an index, and using the mysql client to search your index. Then, it culminates in a quick little PHP script that builds a small search interface around your index. I will be posting the example code into my github account soon. This presentation was given to the LV PHP meetup on August 5th.
Citation preview
Using Sphinx for Search
Mike Lively Slickdeals, LLC
What is Sphinx?• A full-text search engine
• Quickly get high quality (relevant) results
• Designed to integrate well with SQL RDBMS
• Can work with any data source
• Can be queried using either an API or SQL
How do I know anything about Sphinx?
• Manager of Software Architecture for Slickdeals.net
• Alexa top 150 site (in the US)
• Have been working at improving our Sphinx search engine for the last 2 months or so.
• Over 7 Million searches a month directly through the interface, lots more happen indirectly.
When should I use Sphinx?
• Site / Product / Document searches
• Auto-suggest / Auto-Correct functionality
• Finding relevant and related items
Simple Architecture
• Often, search is offloaded straight to the database
• Search goes to the backend which performs queries on the database
• Obviously very easy to implement
Simple Architecture• Simple “starts with” searches
on indexed fields can sometimes work: `city` LIKE ‘Las%’
• Anything else will lock your database for writes with MyISAM.
• MySQL is not a great or flexible full text engine
• It can sometimes be adequate
Sphinx Architecture• Searchd is responsible for
receiving requests from clients and executing the searches against the sphinx index.
• Indexer is responsible for getting data into the sphinx index.
• This separation allows indexing and searching to be scaled separately.
Sphinx Architecture• Searchd has a binary protocol
for which there are several clients available in multiple languages.
• Searchd is also binary compatible with MySQL’s protocol since mysql 4.1
• Searchd is a daemon that runs on your search servers
Sphinx Architecture
• Indexer is a shell program that you can execute to build any number of indexes.
• Can handle index rotation for live indexing
Not So Quick Side NoteMySQL IS SLOWWWWWWWWWWWWW
(at text matches)
Still Not Quick Side NoteIndexes won’t help you…
Quicker Side NoteFull Text Search isn’t so bad
IF….
Sphinx Concepts
• Sphinx Indexes “Documents”
• Each document has a unique unsigned, non-zero integer ID (either 32 bit or 64 bit space)
• Each document has one or more fields
• Each document has zero or more attributes
Indexes / Sources• Sphinx indexes are created from one or more
sources.
• The source can be a database, xml, or tsv stream.
• You can use multiple sources
• This is useful for maintaining updated indexes
• Also used to implement a sphinx cluster
Sphinx Fields• Fields are what the full text index is comprised of.
• When searching you can search against any number of fields.
• You can assign different relevancy weights to different fields.
• The original value of a field is never stored by Sphinx.
• You should always have at least one.
Sphinx Attributes
• data that helps further describe the item being indexed
• Can be returned as a part of the search
• Useful for filtering and sorting results
• These are not a part of the full text index.
MySQL Full Text Search
• You can get away with MyISAM tables or as of version 5.6 InnoDB.
• You don’t care about morphology (think plurals)
• You don’t need anything but the most basic of search operators
Creating An Index
• We are going to add an index that sources a mysql database.
• The data being sourced is a list of the titles of wikipedia posts.
Creating An Index
Indexer Configuration
• We are going to be peaking into a sphinx configuration file now.
• You can rebuild the config file by concatenating each section into a single file.
• On my VM this file is located in /usr/local/etc/sphinx.conf
Source Definition
Source DefinitionDefines the connection information
Connection information
• Ideally, you should create a separate account for sphinx
• You can also connect via unix socket
• I didn’t specify it here, but you can also add a port.
Source DefinitionThe query that pulls data to populate the index
Source Index• The index query MUST return
the id field as the first column
• Remember, the id needs to be a unique, unsigned 64 bit (or less number)
• The query must be on a single line. Unless you escape new lines with back slashes.
• Notice that we converted the timestamp into a unix timestamp. That is important.
Source DefinitionHow data is stored in the index
Source Fields• The first column in the query is
always the ID.
• You specify any columns that are attributes.
• Remember, attributes are stored in the index as fields that can be used to filter and sort by.
• Any field besides the id that is not specified as an attribute, is assumed to be a text field (title)
Index Definition
Index Definition• An Index includes one or
more sources.
• Each source gets it’s own “source” line
• Multiple sources must all define the same fields and attributes.
• The ids need to be unique across resources
Index Definition• path is not actually a path, it’s
a filename with no extension.
• docinfo dictates if attributes are stored in the index or outside of the index.
• dict is not really important now. Used to be either crc or keywords. Now crc is deprecated.
• min_word_len is the minimum length of words to index
Rest of the Index Configuration
It’s time to build the indexindexer <index name>
Searching the Index
• searchd is the daemon that searches the index
• Binary ProtocolOR
• MySQL Compatible too!
searchd configIncluded in the same config file as the rest
Spinning up searchd
–Sphinx
“I know MySQL”
MySQL Compatible
MySQL Compatible
• Tables == Indexes
• SHOW TABLES…Shows indexes.
• Select * From <index> works too.
Selecting from an index
Querying Indexes
• Default limit of 20 rows
• Notice the text fields are not returned…
• They would be if we made them attributes (sql_field_string)
Querying Indexes
• The magic function in SphinxQL is match()
• match() performs a full text search against the entire index…usually
• The ‘@field’ operator can isolate which field is searched on.
Querying Indexes
• You can query against attributes
• You can sort results
• You can use the weight() function to determine relevancy.
Querying Indexes
• The 25387283 title was more relevant because it matched on the term “testing”
Getting PHP into the mix
• All we need? PDO.
• We will build a basic search page
• Accepts a query, displays up to 100 matching results by relevancy with the matching keywords highlighted.
Pulling data from Sphinx
Fetching the data from Mysql
Adding the fancy yellow highlighting
The rest is pretty basic…
Cool things we would talk about if I had like…3 more hours
• Auto-suggest, Auto-correct
• More on lemmatization and stemming
• Distributed Sphinx Clustering
• Delta indexes
• Real Time Indexes
• The plethora of operators you can use
• Ranged Queries
• ………
Additional Information
• The sphinx documentation is actually pretty great
• http://sphinxsearch.com/docs/
• Slides are already on Slideshare
• Will link them to the meet up shortly
Questions?