22
Sandia National Laboratories Enhancing search results relevance with machine learning Pengchu Zhang John Herzer Text Analytics World 2016, Chicago Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

Enhancing search results relevance with machine learning · 2016-06-20 · Sandia National Laboratories Enhancing search results relevance with machine learning Pengchu Zhang John

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Enhancing search results relevance with machine learning · 2016-06-20 · Sandia National Laboratories Enhancing search results relevance with machine learning Pengchu Zhang John

Sandia National Laboratories

Enhancing search results relevance with machine learning

Pengchu Zhang John Herzer

Text Analytics World 2016, Chicago

Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

Page 2: Enhancing search results relevance with machine learning · 2016-06-20 · Sandia National Laboratories Enhancing search results relevance with machine learning Pengchu Zhang John

Sandia National Laboratories

The Holy Grail: Conceptual search

u  Go beyond keyword search to actually search on the concept behind the customer’s query

u  Problems due to the imprecise nature of language •  Synonymy

–  Query is for “dog” but desired document contains “canine” •  Polysemy

–  Does query for “lead” refer to “leading a team” or the chemical element “Pb”

•  Stemming –  Searching for “strike” vs “striking” vs “struck”

u  How do we implement conceptual search?

Page 3: Enhancing search results relevance with machine learning · 2016-06-20 · Sandia National Laboratories Enhancing search results relevance with machine learning Pengchu Zhang John

Sandia National Laboratories

Acquiring a corporate dictionary

u  Commercial / open source dictionaries don’t have words unique to our organization

u  Efforts to build a corporate ontology/taxonomy/dictionary tend to fizzle

u  What we need is a way to build the dictionary from our own corpus in an automated way

u  Word2Vec is an unsupervised machine learning approach that lets us identify related words from the corpus

Page 4: Enhancing search results relevance with machine learning · 2016-06-20 · Sandia National Laboratories Enhancing search results relevance with machine learning Pengchu Zhang John

Sandia National Laboratories

To Improve Search Results with Word2Vec u  Query with a single term or phrase

•  “retirement” Search engine will return documents that contain the term and rank the documents based on the frequency of “retirement”;

u  Word2Vec expands the query term into a set of RELATED terms or phrases

“retirement Pension Savings Eligible pension_fund income_plan Pension_plans”

Search engine will return documents that contain all or some of the terms or phrases and rank the documents based on the frequencies of the set of terms or phrases, the set of terms/phrases represents as “Concept”

u  How to expand a set of RELATED terms from a single query term?

Page 5: Enhancing search results relevance with machine learning · 2016-06-20 · Sandia National Laboratories Enhancing search results relevance with machine learning Pengchu Zhang John

Sandia National Laboratories

We

used

modeling

climate

change

INPUT layer Hidden Layer OUTPUT Layer

Term P Computer 0.9 Algorithm 0.8 Software 0.8 Brain 0.7 iPhone 0.4 Ghost 0.3 Dog 0.2

Concept of Neural Network Language Model

Page 6: Enhancing search results relevance with machine learning · 2016-06-20 · Sandia National Laboratories Enhancing search results relevance with machine learning Pengchu Zhang John

Sandia National Laboratories

Problems of Word Representation in Traditional Language Models

u  One-Hot Representations •  Simple way to encode discrete concepts, such as words

–  “we used computer modeling climate change” –  We = [1 0 0 0 0 0] –  used = [0 1 0 0 0 0] –  computer = [0 0 1 0 0 0] –  modeling = [0 0 0 1 0 0] –  climate = [0 0 0 0 1 0] –  change = [0 0 0 0 0 1]

•  A one-hot encoding makes no assumption about word similarity –  All words are equally different from each other

•  This representation is very high in dimensions –  The dimensionality is the size of the vocabulary –  A typical vocabulary size is 100,000

Page 7: Enhancing search results relevance with machine learning · 2016-06-20 · Sandia National Laboratories Enhancing search results relevance with machine learning Pengchu Zhang John

Sandia National Laboratories

Word2Vec in Natural Language Applications

Lecun, Y, Bengio, Y and Hinton G. Deep Learning. Nature 521, 436-444 (28 May 2015)

Collobert, R., et al. Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, 2493–2537 (2011). Socher, R., Lin, C. C-Y., Manning, C. & Ng, A. Y. Parsing natural scenes and natural language with recursive neural networks. In Proc. International Conference on Machine Learning 129–136 (2011). Mikolov, T., Sutskever, I., Chen, K., Corrado, G. & Dean, J. Distributed representations of words and phrases and their compositionality. In Proc. Advances in Neural Information Processing Systems 26 3111–3119 (2013).

Sutskever, I. Vinyals, O. & Le. Q. V. Sequence to sequence learning with neural networks. In Proc. Advances in Neural Information Processing Systems 27 3104–3112 (2014).

Cho, K. et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proc. Conference on Empirical Methods in Natural Language Processing 1724–1734 (2014). Bahdanau, D., Cho, K. & Bengio, Y. Neural machine translation by jointly learning to align and translate. In Proc. International Conference on Learning Representations http://arxiv.org/abs/1409.0473 (2015).

Page 8: Enhancing search results relevance with machine learning · 2016-06-20 · Sandia National Laboratories Enhancing search results relevance with machine learning Pengchu Zhang John

Sandia National Laboratories

The dog (cat) is walking in the room

(0.12, 0.23, 0.22) (0.32, 0.27, 0.94) (0.18, 0.88, 0.45) (0.23, 0.92, 0.23) (0.77, 0.25, 0.11) (0.12, 0.23, 0.22) (0.41, 0.13, 0.29)

Initial vector

(0.12, 0.23, 0.22) (0.62, 0.99, 0.14) (0.18, 0.88, 0.45) (0.23, 0.92, 0.23) (0.77, 0.25, 0.11) (0.12, 0.23, 0.22) (0.41, 0.13, 0.29)

New vector

Representing Words as Vectors and Adjusting the Vector based on How Words are Used in Writing and Talking

Word2Vec (Google 2013)

Page 9: Enhancing search results relevance with machine learning · 2016-06-20 · Sandia National Laboratories Enhancing search results relevance with machine learning Pengchu Zhang John

Sandia National Laboratories

Two Approaches to Build Word2Vec Models

Page 10: Enhancing search results relevance with machine learning · 2016-06-20 · Sandia National Laboratories Enhancing search results relevance with machine learning Pengchu Zhang John

Sandia National Laboratories

Conceptual Vector Space distribution of terms by Word2Vec

pzt

Ceramic ferroelectrics glasses

Lead_zirconate management

reports disposal

radioactive wastes hlw acceptability

uranium msword

hpc

supercomputing

petascale networking distributed

CERAMICS

NUCLEAR WASTE

CLOUD COMPUTING Pension 0.525072 Savings 0.495344 Eligible 0.469864 pension_fund 0.46272 income_plan 0.449314 pension_plans 0.435645

retirement Vector distance

Related terms

corpus

Vector Space

Page 11: Enhancing search results relevance with machine learning · 2016-06-20 · Sandia National Laboratories Enhancing search results relevance with machine learning Pengchu Zhang John

Sandia National Laboratories

Example of Term Expansion

Page 12: Enhancing search results relevance with machine learning · 2016-06-20 · Sandia National Laboratories Enhancing search results relevance with machine learning Pengchu Zhang John

Sandia National Laboratories

Enhancing queries with Word2Vec

Solr Index

SearchPoint Web service

“science”

+((science)^1.5 (engineering sciences research)^1.2)

Word2vec model

1

2

3

4

5

6

Page 13: Enhancing search results relevance with machine learning · 2016-06-20 · Sandia National Laboratories Enhancing search results relevance with machine learning Pengchu Zhang John

Sandia National Laboratories

Handling synonyms with Word2Vec

dog

Word2Vec

or leash or vet

leash

canine

Page 14: Enhancing search results relevance with machine learning · 2016-06-20 · Sandia National Laboratories Enhancing search results relevance with machine learning Pengchu Zhang John

Sandia National Laboratories

Handling homonyms with Word2Vec

lead

Word2Vec

or team or project

lead

project

lead

contaminant

Page 15: Enhancing search results relevance with machine learning · 2016-06-20 · Sandia National Laboratories Enhancing search results relevance with machine learning Pengchu Zhang John

Sandia National Laboratories

Results without query expansion

Page 16: Enhancing search results relevance with machine learning · 2016-06-20 · Sandia National Laboratories Enhancing search results relevance with machine learning Pengchu Zhang John

Sandia National Laboratories

Results with query expansion

Page 17: Enhancing search results relevance with machine learning · 2016-06-20 · Sandia National Laboratories Enhancing search results relevance with machine learning Pengchu Zhang John

Sandia National Laboratories

Acronyms without query expansion

Page 18: Enhancing search results relevance with machine learning · 2016-06-20 · Sandia National Laboratories Enhancing search results relevance with machine learning Pengchu Zhang John

Sandia National Laboratories

Acronyms with query expansion

Page 19: Enhancing search results relevance with machine learning · 2016-06-20 · Sandia National Laboratories Enhancing search results relevance with machine learning Pengchu Zhang John

Sandia National Laboratories

Evaluating the results of term expansion

Page 20: Enhancing search results relevance with machine learning · 2016-06-20 · Sandia National Laboratories Enhancing search results relevance with machine learning Pengchu Zhang John

Sandia National Laboratories

Integrating Word2Vec with the SearchPoint App

cache

Solr

Oracle

Word2Vec model builder

Oracle

WebLogic App Server

Model Loader Solr client BLOBs

models

content

query

results

Enhanced query

Page 21: Enhancing search results relevance with machine learning · 2016-06-20 · Sandia National Laboratories Enhancing search results relevance with machine learning Pengchu Zhang John

Sandia National Laboratories

Performance considerations

u  SearchPoint is a J2EE app deployed to Weblogic

u  Model File sizes •  solrTerms.ser 479Mb •  solrPhrases.ser 825 Mb

u  Models loaded during Servlet init •  69 seconds to load terms •  Java memory usage: -Xms256m –Xmx2g

u  Time to access terms for query: < 150ms

Page 22: Enhancing search results relevance with machine learning · 2016-06-20 · Sandia National Laboratories Enhancing search results relevance with machine learning Pengchu Zhang John

Sandia National Laboratories

Future work

u  Tackle the query disambiguation problem (polysemy) •  Predict most likely meaning of query term based

on corpus usage •  Predict personalized meaning from customer’s

prior usage u  Identify phrases within the query

•  Can eliminate many non-relevant results •  Given an expertise query “server admin java”

“server admin” or java will provide better results than server or admin or java