View
915
Download
2
Category
Preview:
Citation preview
O C T O B E R 1 3 - 1 6 , 2 0 1 6 • A U S T I N , T X
Simple Fuzzy Name Matching in Solr Chris Mack
Director Customer Engineering Basis Technology
5
02Why Match Names?
Just a Name….
...Right?
1. Security 2. Fraud 3. Commerce
6
01Quick survey: How many of you...
• Regularly develop Solr applications? • Develop Solr applications that include names of… ...People? ...Places? ...Products? ...Organizations? • Have names in languages beside English?
7
03What Makes Name Matching Hard?
8
01Name Variety
9
01Name Variety
10
01Name Ambiguity
11
01How Would You Solve It?
12
01Best Practice: field per variation type?
13
01Idea: Create a Custom Solr Field
• Contribute score that reflects phenomena. • Be part of queries using many field types. • Have multiple fields per document. • Have multiple values per field.
14
01But what if variations co-occur?
“Jesus Alfonso Lopez Diaz” v.
“LobezDias, Chuy” 1) Reordered. 2) Nickname for first name. 3) Missing 2nd Name. 4) Two spelling differences. 5) Missing space.
15
01Can We Do Better?
• Incorporate our proprietary name matching • Provide similarity scores to name pairs • Use Solr’s Rerank feature • Allows for higher precision ranking and tresholding • Provides multi-lingual name search
16
01Simple to Configure
• Plugin contains custom field type which does all the work behind the scenes
• Simple addition to schema.xml to include new field type
<fieldType name="rni_name" class="com.basistech.rni.solr.NameField"/> <field name="name" type="rni_name" indexed="true" stored="true" multiValued="false"/> <field name="aka" type="rni_name" indexed="true" stored="true" multiValued="true"/>
17
01Plug-in Implementation
18
01What happens at query time?
• Step #1: NameField generates analogous keys for a custom Lucene query that finds good candidates for re-ranking
public Query getFieldQuery(QParser parser, SchemaField field, String val) { Name name = parseNameString(externalVal, parser.getParams()); QuerySpec querySpec = buildQuery(name); return querySpec.accept(new SolrQueryVisitor(field.getName())); }
19
01What else happens at query time?
• Step #2: Uses Solr’s Rerank feature to rescore names in top documents and reorder accordingly
- Tuned for high precision - Simple addition to solrconfig.xml
<queryParser name="rniRerank" class="com.basistech.rni.solr.RNIReRankQParserPlugin"/> <valueSourceParser name="rniMatch” class="com.basistech.rni.solr.NameMatchValueSourceParser"/>
20
01Plug-in Implementation
21
01Ability to Tradeoff Accuracy vs. Speed
• reRankScoreThreshold - Score threshold top doc must meet to be rescored.
• reRankDocs - Controls how many of the top documents to rescore
22
01Summary: How it works
• Custom field type - Splits a single field into multiple fields covering different phenomena - Supports multiple name fields in a document as well as multivalued fields - Intercepts the query to inject a custom Lucene query
• Custom rerank function - Rescores documents with algorithm specific to name matching - Limits intense calculations to only top candidates - Highly configurable
23
01Suggested Questions:
• What is names are in unstructured text? • What if the names are in other text fields? • How did you implement multi-valued fields? • How does it scale? • How do you handle names not in English? • How does this relate to the theme of Entity-Centric
Search? • How do plug-in’s scores relate to Solr scores? • How can I learn more?
Simple Fuzzy Name Matching in Solr Chris Mack
Director Customer Engineering Basis Technology
Recommended