Upload
basis-technology
View
2.040
Download
1
Tags:
Embed Size (px)
DESCRIPTION
This talk will explore the challenges of Multilingual search, including language-specific issues — like N-gram segmentation vs. morphological analysis, stemming vs. lemmatization, and language identification — and the various approaches to configuring your Solr schema. We will also discuss the integration strategies for common text analytics capabilities and the impact of multilingual content on application design. Solr is a powerful search engine which rapidly gained acceptance as an alternative to commercial search solutions for many applications. There are many features required by organizations to serve their diverse communities, among these is the ability to deliver search excellence in foreign languages. Delivering quality multilingual search involves careful design of schemas and selection of the best linguistic approach for each supported language.
Citation preview
Basis Technology – Open Source Search Conference 2012 1
Multilingual Search and Text Analytics with Solr Steve Kearns
Director of Product Management
Basis Technology
Basis Technology – Open Source Search Conference 2012 2
Agenda
• Why is Language Important? • Approaches for language-‐aware search • Solr Configura>on Op>ons
Basis Technology – Open Source Search Conference 2012 3
Language is Important
Basis Technology – Open Source Search Conference 2012 4
Why is language important?
• Content is produced and consumed in the na>ve language
• Document collec>ons oBen contain more than one language
• Each language is unique, and presents different challenges to the search engine
Basis Technology – Open Source Search Conference 2012 5
Language is Complex
• Tokeniza>on • Some languages do not use spaces • Compound words combine two or more words • Conjunc>ons
• Inflec>on • In grammar, inflec>on is the modifica>on of a word to express different gramma>cal categories such as tense, gramma>cal mood, gramma>cal voice, aspect, person, number, gender and case.
hOp://en.wikipedia.org/wiki/Inflec>on
Basis Technology – Open Source Search Conference 2012 6
Language is Complex
hOp://en.wikipedia.org/wiki/File:Flexi%C3%B3nGato.png
Basis Technology – Open Source Search Conference 2012 7
Language is Complex!
• The Spanish word “pasaportar” has more than 50 inflected forms:
pasaportando pasaportes pasaportada pasaportaba pasaportarían pasaportarais pasaportasen pasaportaren pasaportado pasaportaremos pasaportábamos pasaportases pasaportaríais pasaportaran pasaportarías pasaportaras pasaportarás
pasaportareis pasaportaron pasaportase pasaportemos pasaportaría pasaportara pasaportasteis pasaportáramos pasaportaban pasaportásemos pasaportamos pasaporten pasaportaréis pasaportabas pasaportaríamos pasaportáremos pasaporto
pasaportarán pasaporte pasaportan pasaporta pasaportaste pasaportad pasaportéis pasaportadas pasaporté pasaportados pasaportaré pasaportare pasaportará pasaportó pasaportabais pasaportaseis …
http://education.yahoo.com/reference/dict_en_es/spanish/pasaportar
Basis Technology – Open Source Search Conference 2012 8
Language Examples
• English:
• French:
• German:
• Japanese: • 首脳会談後、オバマ大統領は記者団の質問に答える予定
– Where are the words??
spoke (Noun – wheel part) → spoke spoke (Verb, past tense) → speak
été (summer) → été (summer) été (was) → être (to be)
Robbe (seal) → Robbe (seal) robbe (I crawl) → robben (to crawl)
Samstagmorgen (Saturday Morning) → Samstag, Morgen (compound)
Basis Technology – Open Source Search Conference 2012 9
Language-‐Aware Search Technology
• RoseOe Linguis>c Plaiorm • Language Iden>fica>on • Tokeniza>on
» Morphological
• Token processing » Lemma>za>on
• Higher level analy>cs » En>ty Extrac>on » Rela>onship Extrac>on
• En>ty Transla>on and En>ty Search
Basis Technology – Open Source Search Conference 2012 10
Language Iden>fica>on
• Find a single dominant language in a document • Find mul>ple languages in a single document
Basis Technology – Open Source Search Conference 2012 11
Tokeniza>on
• Morphological Analysis vs. N-‐gram • Search Term: 東京 ルパン上映時間
• N-‐gram:
• Morphological Analysis:
Basis Technology – Open Source Search Conference 2012 12
Token Processing
• Stemming vs. Lemma>za>on • English: “I have spoken at several conferences” • Stemming:
• Lemma>za>on:
Basis Technology – Open Source Search Conference 2012 13
Stemming vs. Lemma>za>on
• Two words with the same spelling, but different meanings create the same stem.
Stemming prensa (media)
→ prens
prensa (he/she presses)
→ prens
INCORRECT
LemmaCzaCon Prensa
(media) → prensa (media)
prensa (he/she presses)
→ prensar (to press)
CORRECT
Basis Technology – Open Source Search Conference 2012 14
Stemming vs. Lemma>za>on
• Two different words create the same stem.
Stemming publicaciones (publicaCons)
→ public
publico (public)
→ public
INCORRECT
LemmaCzaCon publicaciones (publicaCons)
→ publicación
publico (public)
→ public (public)
CORRECT
Basis Technology – Open Source Search Conference 2012 15
Token Processing
German: “Am Samstagmorgen fliege ich zurueck nach Boston.”
• Stemming:
• Lemma>za>on (and decompounding!):
Basis Technology – Open Source Search Conference 2012 16
How to Configure Solr
• Challenges • Mul>ple languages in the data set
• Goals: 1. Language Iden>fica>on 2. Language-‐aware Search:
• Tokeniza>on • Token Processing
Basis Technology – Open Source Search Conference 2012 17
How to Configure Solr
• What tools does Solr have to work with? • UpdateRequestProcessor • Analyzer/CharFilter/Tokenizer/TokenFilter • Solr Cores
• Pre-‐process data before Solr?
Basis Technology – Open Source Search Conference 2012 18
Solr UpdateRequestProcessor
• Runs Before Analyzers • Full Access to Document
• Two op>ons: • Run the analysis directly in Solr
• Good for Lightweight Analysis • Call out to external analysis services
• Web Services/UIMA. Increases Complexity
• Limita>ons: • Think through your indexing strategy
Basis Technology – Open Source Search Conference 2012 19
Solr Analyzer/Tokenizer
• Good for: • Segmenta>on of Asian Language • Linguis>cs -‐ Lemma>za>on
• Limita>ons: • No access to document object
• Schema.xml
• FieldType • Analyzer
– CharFilter – Tokenize – TokenFilter
Basis Technology – Open Source Search Conference 2012 20
Goal 1: Language ID
• UpdateRequestProcessor • Runs before field-‐level analysis takes place • Basic Language Iden>fier URP to be included in Solr
• Outside Solr
What do you do with the language informa>on??
Basis Technology – Open Source Search Conference 2012 21
Goal 2: Mul>-‐Lingual Support in Solr
• Three main approaches:
1. One Solr field for each language
2. One Solr Core per language
3. All Languages in a Single Field
Informed by Trey Grainger @ Careerbuilder: hOp://www.lucidimagina>on.com/sites/default/files/Grainger%20Trey%20-‐%20Extending%20Solr,%20Building%20a%20Cloud-‐Like%20Knowledge%20Discovery%20Plaiorm%20-‐%20rev.pdf
Basis Technology – Open Source Search Conference 2012 22
Mul>ple Languages: Method 1
• One field for each language • Pro:
• Simple approach and implementa>on • Guarantees that queries are processed the same way as index
• Con: • Increased query-‐>me complexity (mi>gate with Dismax) • Decreased query speed as addi>onal fields are queried • May require storing mul>ple copies of text
Basis Technology – Open Source Search Conference 2012 23
Mul>ple Languages: Method 2
• One Solr core per language Each Core has the same field, with a language-‐specific Analyzer/Tokenizer • Pros:
• No query-‐>me performance overhead • Guarantees that queries are processed the same way as index
• Cons: • Significant complexity in managing mul>ple cores • Must implement custom sharding • Does not support mul>lingual documents
Basis Technology – Open Source Search Conference 2012 24
Mul>ple Languages: Method 3
• All Languages in one field • Pros:
• Single field makes queries and indexing easy • Same schema/core as more languages added
• Cons: • Requires complex custom Tokenizer/Analyzer • Must pass in language informa>on for queries and indexing • Does not guarantee queries are processed the same as the index
• Poten>al TF/IDF confusion
Basis Technology – Open Source Search Conference 2012 25
Language is Important
• Use language informa>on at index and query >me • Increase recall, maintain precision
• BeOer search results for your users
Basis Technology – Open Source Search Conference 2012 26
My Contact Info
• Steve Kearns • [email protected] • hOp://www.basistech.com