TechTalk #13 Grokking: Marrying Elasticsearch with NLP to solve real-world search problems

Marrying Elasticsearch with NLP to solve real-world search problemsPhu Le, Knorex @ Grokking TechTalk

25 June 2016

Web : http://knorex.comEmail : info@knorex.com

Knorex Lumina Web ServicesTM

2 / 36

3 / 36

4 / 36

5 / 36

1. Architecture2. Ingredients

• Data gathering• Content extraction• Preprocessing• Modelling: terms -> phrases, entities -> documents

3. Elasticsearch• Basic analysis, faceting and filtering• Do you mean• Percolator• Recommendation• Deduplication

4. Summary

Outline

6 / 36

Architecture

7 / 36

1. Data gathering• Deep crawler• Lazy crawler• Visual scraper• Social media adapters

2. Content extraction• Take news article as an example

• Title• Content• Published date• Author• Image• …

Ingredients

8 / 36

Content extraction

9 / 36

Content extraction

10 / 36

3. Preprocessing• Sentence splitting, Tokenization

• Stemming vs Lemmatizing• Stemming: cries, crying, cried => cri• Lemmatizing: dogs => dog; is, are => be

Ingredients

11 / 36

3. Modelling• Goal: synthesizing words, tokens into larger units and attach meaning to them

• Key phrases extractions• Named entity recognition

• Basic building block of knowledge• Basis for computing relatedness and extracting relations

• Sentiment analysis• Social media snippet• General article or towards concepts / named entities

• Emotion• Document classification

• Group search results into faceted categories• Recommend related articles by category

Ingredients

12 / 36

13 / 36

Phrases

14 / 36

Entities

15 / 36

Document classification

16 / 36

• First released Feb 2010, among fastest-growing open-source projects, total funding $104M (3 rounds)

• Based on Apache Lucene (same as Solr)• Written in Java, support HTTP interface, schema-free JSON document (yay no XML!)

• Designed to be scalable, distributed in nature

17 / 36

Analysis

”analyzer”: “standard” ”analyzer”: “whitespace” ”analyzer”: “keyword”

18 / 36

Analysis

Successful!

[“https”, “www.facebook.com”, ”events”, “194454270949757“]

No hits! WTH… it is not working!!!!

Default analyzer

• url => not_analyzed / keyword analyzer• Use match query instead of term filter

/ term query: field analyzer awareness• Custom analyzer: e.g. keyword

tokenizer + lowercase filter

19 / 36

AnalysisIn

Search analyzer

Index analyzer

Elasticsearch index

Search Index

• Design carefully what fields that search will be executed frequently on

• Determine what analyzers to use for each field (experimental based on application needs)

• Search analyzer and index analyzer might be different for the same field

• Use match query instead of term filter / term query: field analyzer awareness

• Exploit multi-field

20 / 36

Faceting and filtering

21 / 36

Do you mean• “grok” -> “grokking”, “sear” -> “search”• Natural approach:

• Compute terms aggregation (facet) across all text fields• title• description• content

• Use regex to filter matched terms, sort DESC by frequency, take most popular terms to suggest

DON’T!!!

22 / 36

Do you mean• Limitations

• Single terms only. Cannot suggest phrases• Terms occurring frequently might not be useful

• Improvements• Building another field “phrases” in the document

• adding entire title• Using key phrases extraction, named entity recognition to populate

meaningful phrases• Custom tokenizers: keyword, edgeNGram• edgeNGram example: “grokking” => “gro”, “grok”, “grokk”

• Query: “burs mal” => matched: “bursa malaysia”• memory explosion!!!

• Custom scoring (importance, popularity score) instead of term frequency

24 / 36

Do you mean• Elasticsearch built-in suggester

• FST example. Source: https://www.elastic.co/blog/you-complete-me

• Features:• Speed & scale: FST per-segment, build in real-time, scale

horizontally• Analysis: synonym, fuzzy• Support custom ordering and scoring

• Limitations: can’t find word anywhere within a phrase

25 / 36

Do you mean• Speed test: 1 millions articles, 2.7 GB index size on

single laptop with SSD

• Cautions• Don’t add all terms/phrases to suggestion (only meaningful

ones!)• Don’t start suggesting immediately. How many words starting

with “c”?• Don’t suggest terms that yield no search results

• Apply same filter condition of current query to the term suggestion query

Regex terms facet

Terms suggester

296.5 ms 13 ms

26 / 36

Percolator• percolate: match documents against queries

27 / 36

Percolator• Sample use case: segmenting articles using keywords

28 / 36

Recommendation• Natural approach

• More-like-this or fuzzy-like-this on title, content• Not accurate, bag-of-word approach.• Tricky in determining threshold. ”Good value” varies across

different document types and domains• Slow. The more terms allowed in the queries, the slower it is. If

cut off based on max terms, then accuracy drops

• Proposed approaches• Utilize NLP results (modelling step):

• Category: recommend articles from same categories• Key phrases: match and rank documents w.r.t target documents by key

phrases• Named entities: model with parent/child relationship

• Combine with function score feature to rescore results• Example: applying a Gauss decay function to favor more recent

results

29 / 36

Recommendation

• Sophisticated scoring and rankingcan be done outside of Elasticsearch• Still, can tap on Elasticsearch for facetingand filtering capability

30 / 36

Deduplication• Natural approach

• Term matching on URL, title• Failed if these are slightly different (very common!)

• More-like-this or fuzzy-like-this on content, with high matching threshold: e.g. 70%, 80%

• Not accurate, bag-of-word approach.• Tricky in determining threshold. ”Good value” varies across

different dcoument types and domains• Slow. The more terms allowed in the queries, the slower it is. If

cut off based on max terms, then accuracy drops

• Proposed approach• Semantic hashing: minhash, simhash

• for a document, compute a hash value• convert the hash value to binary string form• robust and efficient, can cater to near-duplicate

• Implement Hamming distance search using Elasticsearch fuzzy_like_this

31 / 36

Deduplication

• Do not index duplicate at allor• Collapse similar items in search results, display only the

one with highest score• Assign same id for articles that are duplicate (called it

groupid)• Use Elasticsearch Top Hits query to collapse result by groupid

Þ 64-bit hash:100001000100011110100101101111001011110100001

1100101101001011101

Modified version:101001000100011110101101101111001011110100001

1100101101000011101

Hamming distance: 3

32 / 36

Further reading• Dismax vs bool queries• Term vs text queries• Filter vs filtered• Facets (old) vs aggregations (facets reborn + statistics)• Geo

33 / 36

Summary• ES is very flexible with numerous features and knobs• Critical to understand basic analysis, different types of

queries• Indexing time and search time tradeoff• Precision and recall tradeoff• Complexity and memory estimation• Use NLP techniques as modelling step to improve search

quality• Pay great attention to data input and data gathering step

34 / 36

About KnorexFounded in 2010 as spin-off from Data Mining Dept. of A*STAR, Singapore

Enabling our customers to make smarter discovery and turn it into actionable insight

Mission

35 / 36

https://www.knorex.com

https://itviec.com/companies/knorex36 / 36

Thank you

TechTalk #13 Grokking: Marrying Elasticsearch with NLP to solve real-world search problems

Technology

Grokking Git with Shakespeare

TechTalk - back.novatechautomation.com

Food techtalk

Grokking Magento: Book 1 - Basics & Request Flowvinaikopp.com/media/grokking-magento/Grokking-Magento-book01... · Grokking Magento: Book 1 - Basics & Request Flow ... TESTING FRAMEWORK

Grokking TechTalk #11 - Why Data Science?

GTD techtalk

Grokking TechTalk #16: F**k bad CSS

Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

JoinMe TechTalk

Grokking the Paradigm Creating a Component

TechTalk #14 Grokking: Couchbase - NoSQL + Memcached + Real-time + Offline!

Grokking TechTalk 9 - When a Java guy goes Ruby

Grokking Hash Tables

Grokking TechTalk #18A: Vietnamese Sentiment Analysis in a Big Data Scenario: The Deep Learning Approach

Grokking Grok: Monitorama PDX 2015

Grokking the Paradigm Changing Column Layouts

Grokking TechTalk #16: Maybe functor in javascript

Techtalk Varnish

Grokking the Paradigm Modifying Menus

Grokking the Paradigm Reconstructing Webtop