Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Sandia National Laboratories
Enhancing search results relevance with machine learning
Pengchu Zhang John Herzer
Text Analytics World 2016, Chicago
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.
Sandia National Laboratories
The Holy Grail: Conceptual search
u Go beyond keyword search to actually search on the concept behind the customer’s query
u Problems due to the imprecise nature of language • Synonymy
– Query is for “dog” but desired document contains “canine” • Polysemy
– Does query for “lead” refer to “leading a team” or the chemical element “Pb”
• Stemming – Searching for “strike” vs “striking” vs “struck”
u How do we implement conceptual search?
Sandia National Laboratories
Acquiring a corporate dictionary
u Commercial / open source dictionaries don’t have words unique to our organization
u Efforts to build a corporate ontology/taxonomy/dictionary tend to fizzle
u What we need is a way to build the dictionary from our own corpus in an automated way
u Word2Vec is an unsupervised machine learning approach that lets us identify related words from the corpus
Sandia National Laboratories
To Improve Search Results with Word2Vec u Query with a single term or phrase
• “retirement” Search engine will return documents that contain the term and rank the documents based on the frequency of “retirement”;
u Word2Vec expands the query term into a set of RELATED terms or phrases
“retirement Pension Savings Eligible pension_fund income_plan Pension_plans”
Search engine will return documents that contain all or some of the terms or phrases and rank the documents based on the frequencies of the set of terms or phrases, the set of terms/phrases represents as “Concept”
u How to expand a set of RELATED terms from a single query term?
Sandia National Laboratories
We
used
modeling
climate
change
INPUT layer Hidden Layer OUTPUT Layer
Term P Computer 0.9 Algorithm 0.8 Software 0.8 Brain 0.7 iPhone 0.4 Ghost 0.3 Dog 0.2
Concept of Neural Network Language Model
Sandia National Laboratories
Problems of Word Representation in Traditional Language Models
u One-Hot Representations • Simple way to encode discrete concepts, such as words
– “we used computer modeling climate change” – We = [1 0 0 0 0 0] – used = [0 1 0 0 0 0] – computer = [0 0 1 0 0 0] – modeling = [0 0 0 1 0 0] – climate = [0 0 0 0 1 0] – change = [0 0 0 0 0 1]
• A one-hot encoding makes no assumption about word similarity – All words are equally different from each other
• This representation is very high in dimensions – The dimensionality is the size of the vocabulary – A typical vocabulary size is 100,000
Sandia National Laboratories
Word2Vec in Natural Language Applications
Lecun, Y, Bengio, Y and Hinton G. Deep Learning. Nature 521, 436-444 (28 May 2015)
Collobert, R., et al. Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, 2493–2537 (2011). Socher, R., Lin, C. C-Y., Manning, C. & Ng, A. Y. Parsing natural scenes and natural language with recursive neural networks. In Proc. International Conference on Machine Learning 129–136 (2011). Mikolov, T., Sutskever, I., Chen, K., Corrado, G. & Dean, J. Distributed representations of words and phrases and their compositionality. In Proc. Advances in Neural Information Processing Systems 26 3111–3119 (2013).
Sutskever, I. Vinyals, O. & Le. Q. V. Sequence to sequence learning with neural networks. In Proc. Advances in Neural Information Processing Systems 27 3104–3112 (2014).
Cho, K. et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proc. Conference on Empirical Methods in Natural Language Processing 1724–1734 (2014). Bahdanau, D., Cho, K. & Bengio, Y. Neural machine translation by jointly learning to align and translate. In Proc. International Conference on Learning Representations http://arxiv.org/abs/1409.0473 (2015).
Sandia National Laboratories
The dog (cat) is walking in the room
(0.12, 0.23, 0.22) (0.32, 0.27, 0.94) (0.18, 0.88, 0.45) (0.23, 0.92, 0.23) (0.77, 0.25, 0.11) (0.12, 0.23, 0.22) (0.41, 0.13, 0.29)
Initial vector
(0.12, 0.23, 0.22) (0.62, 0.99, 0.14) (0.18, 0.88, 0.45) (0.23, 0.92, 0.23) (0.77, 0.25, 0.11) (0.12, 0.23, 0.22) (0.41, 0.13, 0.29)
New vector
Representing Words as Vectors and Adjusting the Vector based on How Words are Used in Writing and Talking
Word2Vec (Google 2013)
Sandia National Laboratories
Two Approaches to Build Word2Vec Models
Sandia National Laboratories
Conceptual Vector Space distribution of terms by Word2Vec
pzt
Ceramic ferroelectrics glasses
Lead_zirconate management
reports disposal
radioactive wastes hlw acceptability
uranium msword
hpc
supercomputing
petascale networking distributed
CERAMICS
NUCLEAR WASTE
CLOUD COMPUTING Pension 0.525072 Savings 0.495344 Eligible 0.469864 pension_fund 0.46272 income_plan 0.449314 pension_plans 0.435645
retirement Vector distance
Related terms
corpus
Vector Space
Sandia National Laboratories
Example of Term Expansion
Sandia National Laboratories
Enhancing queries with Word2Vec
Solr Index
SearchPoint Web service
“science”
+((science)^1.5 (engineering sciences research)^1.2)
Word2vec model
1
2
3
4
5
6
Sandia National Laboratories
Handling synonyms with Word2Vec
dog
Word2Vec
or leash or vet
leash
canine
Sandia National Laboratories
Handling homonyms with Word2Vec
lead
Word2Vec
or team or project
lead
project
lead
contaminant
Sandia National Laboratories
Results without query expansion
Sandia National Laboratories
Results with query expansion
Sandia National Laboratories
Acronyms without query expansion
Sandia National Laboratories
Acronyms with query expansion
Sandia National Laboratories
Evaluating the results of term expansion
Sandia National Laboratories
Integrating Word2Vec with the SearchPoint App
cache
Solr
Oracle
Word2Vec model builder
Oracle
WebLogic App Server
Model Loader Solr client BLOBs
models
content
query
results
Enhanced query
Sandia National Laboratories
Performance considerations
u SearchPoint is a J2EE app deployed to Weblogic
u Model File sizes • solrTerms.ser 479Mb • solrPhrases.ser 825 Mb
u Models loaded during Servlet init • 69 seconds to load terms • Java memory usage: -Xms256m –Xmx2g
u Time to access terms for query: < 150ms
Sandia National Laboratories
Future work
u Tackle the query disambiguation problem (polysemy) • Predict most likely meaning of query term based
on corpus usage • Predict personalized meaning from customer’s
prior usage u Identify phrases within the query
• Can eliminate many non-relevant results • Given an expertise query “server admin java”
“server admin” or java will provide better results than server or admin or java