Using Sphinx for Search in PHP

Preview:

DESCRIPTION

This is an intro to Sphinx and PHP. It will take you through the very basics of how Sphinx works, how you can set up an index, and using the mysql client to search your index. Then, it culminates in a quick little PHP script that builds a small search interface around your index. I will be posting the example code into my github account soon. This presentation was given to the LV PHP meetup on August 5th.

Citation preview

Using Sphinx for Search

Mike Lively Slickdeals, LLC

What is Sphinx?• A full-text search engine

• Quickly get high quality (relevant) results

• Designed to integrate well with SQL RDBMS

• Can work with any data source

• Can be queried using either an API or SQL

How do I know anything about Sphinx?

• Manager of Software Architecture for Slickdeals.net

• Alexa top 150 site (in the US)

• Have been working at improving our Sphinx search engine for the last 2 months or so.

• Over 7 Million searches a month directly through the interface, lots more happen indirectly.

When should I use Sphinx?

• Site / Product / Document searches

• Auto-suggest / Auto-Correct functionality

• Finding relevant and related items

Simple Architecture

• Often, search is offloaded straight to the database

• Search goes to the backend which performs queries on the database

• Obviously very easy to implement

Simple Architecture• Simple “starts with” searches

on indexed fields can sometimes work: `city` LIKE ‘Las%’

• Anything else will lock your database for writes with MyISAM.

• MySQL is not a great or flexible full text engine

• It can sometimes be adequate

Sphinx Architecture• Searchd is responsible for

receiving requests from clients and executing the searches against the sphinx index.

• Indexer is responsible for getting data into the sphinx index.

• This separation allows indexing and searching to be scaled separately.

Sphinx Architecture• Searchd has a binary protocol

for which there are several clients available in multiple languages.

• Searchd is also binary compatible with MySQL’s protocol since mysql 4.1

• Searchd is a daemon that runs on your search servers

Sphinx Architecture

• Indexer is a shell program that you can execute to build any number of indexes.

• Can handle index rotation for live indexing

Not So Quick Side NoteMySQL IS SLOWWWWWWWWWWWWW

(at text matches)

Still Not Quick Side NoteIndexes won’t help you…

Quicker Side NoteFull Text Search isn’t so bad

IF….

Sphinx Concepts

• Sphinx Indexes “Documents”

• Each document has a unique unsigned, non-zero integer ID (either 32 bit or 64 bit space)

• Each document has one or more fields

• Each document has zero or more attributes

Indexes / Sources• Sphinx indexes are created from one or more

sources.

• The source can be a database, xml, or tsv stream.

• You can use multiple sources

• This is useful for maintaining updated indexes

• Also used to implement a sphinx cluster

Sphinx Fields• Fields are what the full text index is comprised of.

• When searching you can search against any number of fields.

• You can assign different relevancy weights to different fields.

• The original value of a field is never stored by Sphinx.

• You should always have at least one.

Sphinx Attributes

• data that helps further describe the item being indexed

• Can be returned as a part of the search

• Useful for filtering and sorting results

• These are not a part of the full text index.

MySQL Full Text Search

• You can get away with MyISAM tables or as of version 5.6 InnoDB.

• You don’t care about morphology (think plurals)

• You don’t need anything but the most basic of search operators

Creating An Index

• We are going to add an index that sources a mysql database.

• The data being sourced is a list of the titles of wikipedia posts.

Creating An Index

Indexer Configuration

• We are going to be peaking into a sphinx configuration file now.

• You can rebuild the config file by concatenating each section into a single file.

• On my VM this file is located in /usr/local/etc/sphinx.conf

Source Definition

Source DefinitionDefines the connection information

Connection information

• Ideally, you should create a separate account for sphinx

• You can also connect via unix socket

• I didn’t specify it here, but you can also add a port.

Source DefinitionThe query that pulls data to populate the index

Source Index• The index query MUST return

the id field as the first column

• Remember, the id needs to be a unique, unsigned 64 bit (or less number)

• The query must be on a single line. Unless you escape new lines with back slashes.

• Notice that we converted the timestamp into a unix timestamp. That is important.

Source DefinitionHow data is stored in the index

Source Fields• The first column in the query is

always the ID.

• You specify any columns that are attributes.

• Remember, attributes are stored in the index as fields that can be used to filter and sort by.

• Any field besides the id that is not specified as an attribute, is assumed to be a text field (title)

Index Definition

Index Definition• An Index includes one or

more sources.

• Each source gets it’s own “source” line

• Multiple sources must all define the same fields and attributes.

• The ids need to be unique across resources

Index Definition• path is not actually a path, it’s

a filename with no extension.

• docinfo dictates if attributes are stored in the index or outside of the index.

• dict is not really important now. Used to be either crc or keywords. Now crc is deprecated.

• min_word_len is the minimum length of words to index

Rest of the Index Configuration

It’s time to build the indexindexer <index name>

Searching the Index

• searchd is the daemon that searches the index

• Binary ProtocolOR

• MySQL Compatible too!

searchd configIncluded in the same config file as the rest

Spinning up searchd

–Sphinx

“I know MySQL”

MySQL Compatible

MySQL Compatible

• Tables == Indexes

• SHOW TABLES…Shows indexes.

• Select * From <index> works too.

Selecting from an index

Querying Indexes

• Default limit of 20 rows

• Notice the text fields are not returned…

• They would be if we made them attributes (sql_field_string)

Querying Indexes

• The magic function in SphinxQL is match()

• match() performs a full text search against the entire index…usually

• The ‘@field’ operator can isolate which field is searched on.

Querying Indexes

• You can query against attributes

• You can sort results

• You can use the weight() function to determine relevancy.

Querying Indexes

• The 25387283 title was more relevant because it matched on the term “testing”

Getting PHP into the mix

• All we need? PDO.

• We will build a basic search page

• Accepts a query, displays up to 100 matching results by relevancy with the matching keywords highlighted.

Pulling data from Sphinx

Fetching the data from Mysql

Adding the fancy yellow highlighting

The rest is pretty basic…

Cool things we would talk about if I had like…3 more hours

• Auto-suggest, Auto-correct

• More on lemmatization and stemming

• Distributed Sphinx Clustering

• Delta indexes

• Real Time Indexes

• The plethora of operators you can use

• Ranged Queries

• ………

Additional Information

• The sphinx documentation is actually pretty great

• http://sphinxsearch.com/docs/

• Slides are already on Slideshare

• Will link them to the meet up shortly

Questions?

Recommended