26
Representing Documents via Latent Keyphrase Inference April. 15 th , 2016

Representing Documents via Latent Keyphrase Inference · 26 q We have introduced a novel document representation method using latent keyphrases q Each dimension is explainable q Works

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Representing Documents via Latent Keyphrase Inference · 26 q We have introduced a novel document representation method using latent keyphrases q Each dimension is explainable q Works

Representing Documents via LatentKeyphrase Inference

April. 15th, 2016

Page 2: Representing Documents via Latent Keyphrase Inference · 26 q We have introduced a novel document representation method using latent keyphrases q Each dimension is explainable q Works

2

Document Representation in Vector Space

Critical for document retrieval, categorization

Page 3: Representing Documents via Latent Keyphrase Inference · 26 q We have introduced a novel document representation method using latent keyphrases q Each dimension is explainable q Works

3

q Bag-of-Words or Phrases

q Cons: Sparse on short texts

Traditional Methods

Page 4: Representing Documents via Latent Keyphrase Inference · 26 q We have introduced a novel document representation method using latent keyphrases q Each dimension is explainable q Works

4

q Topic models [LDA]

q Cons: Difficult for human to infer topic semanticsEach topic is a distribution over words, each document is a mixture of corpus-wide topics

Page 5: Representing Documents via Latent Keyphrase Inference · 26 q We have introduced a novel document representation method using latent keyphrases q Each dimension is explainable q Works

5

q Concept-based models [ESA]

q Cons: Low coverage of concepts in human-curated knowledge base

Every Wikipedia article represents a concept

Article words are associated with the concept (TF.IDF), which help infer concepts from document

Concept:Panthera

Cat[0.92]

Leopard[0.84]

Roar[0.77]

Page 6: Representing Documents via Latent Keyphrase Inference · 26 q We have introduced a novel document representation method using latent keyphrases q Each dimension is explainable q Works

6

q Word/Document embedding models [word2vec paragraph2vec]

q Cons: Difficult to explain what each dimension means

Page 7: Representing Documents via Latent Keyphrase Inference · 26 q We have introduced a novel document representation method using latent keyphrases q Each dimension is explainable q Works

7

q Use domain keyphrases as the entries in the vector andq Identify document keyphrases (subset of domain keyphrases)

by evaluating relatedness between (doc, domain keyphrase)q Unsupervised model

Document Representation Using Keyphrases

Corpus Domain Keyphrases <K1, K2, …, KM>

Page 8: Representing Documents via Latent Keyphrase Inference · 26 q We have introduced a novel document representation method using latent keyphrases q Each dimension is explainable q Works

8

q Where to get domain keyphrases from a given corpus?q Mining Quality Phrases from Massive Text Corpora [SIGMOD15]

q How to identify document keyphrases?q Can be latent mentions (short text)q Relatedness scores

Challenges

Page 9: Representing Documents via Latent Keyphrase Inference · 26 q We have introduced a novel document representation method using latent keyphrases q Each dimension is explainable q Works

9

q Powered by Bayesian Inference on “Domain Keyphrase Silhouette”q Domain Keyphrase Silhouette: Topic centered on domain keyphrase

q “Reverse” topic modelsq Learned from corpus

How to identify document keyphrases?

Page 10: Representing Documents via Latent Keyphrase Inference · 26 q We have introduced a novel document representation method using latent keyphrases q Each dimension is explainable q Works

10

Framework for Latent Keyphrase Inference (LAKI)

Page 11: Representing Documents via Latent Keyphrase Inference · 26 q We have introduced a novel document representation method using latent keyphrases q Each dimension is explainable q Works

11

Page 12: Representing Documents via Latent Keyphrase Inference · 26 q We have introduced a novel document representation method using latent keyphrases q Each dimension is explainable q Works

12

q Learning Hierarchical Bayesian Network (DAG)

BinaryVariables

Domain Keyphrase Silhouette

Task 1: Model Learning: learning link weightsTask 2: Structure Learning: learning network structure

Page 13: Representing Documents via Latent Keyphrase Inference · 26 q We have introduced a novel document representation method using latent keyphrases q Each dimension is explainable q Works

13

Noise/Prior Aggregated over all other linksconnected with 𝑍"

q Use Z to represent K (domain keyphrases)and T (content units)

q Noisy-ORq A parent node is easier to activate its children

when the link weight is largerq A child node is influenced by all its parents

Task 1: Model Learning given Structure

Toy example

Page 14: Representing Documents via Latent Keyphrase Inference · 26 q We have introduced a novel document representation method using latent keyphrases q Each dimension is explainable q Works

14

q Training data: Documentsq Expectation-step:q For each document, collect sufficient statisticsq Link firing (Parent, child both being activated) probabilityq Node activation probability

q Maximization-step:q Update link weight

Fully observedcontent units

Partially observeddocument keyphrases

Maximum Likelihood Estimation

Page 15: Representing Documents via Latent Keyphrase Inference · 26 q We have introduced a novel document representation method using latent keyphrases q Each dimension is explainable q Works

15

q Domain keyphrases are connected to content unitsq Help infer document keyphrases from content units

q Domain keyphrases are interconnectedq Help infer document keyphrases from other keyphrases

Task 2: Structure Learning

Page 16: Representing Documents via Latent Keyphrase Inference · 26 q We have introduced a novel document representation method using latent keyphrases q Each dimension is explainable q Works

16

q Data-Driven, DAG, similar to ontologyq Heuristic:q Two nodes are connected onlyq Closely Related: word2vecq Co-occur frequently

q Links are always point to less frequent nodesq Work well in practice

A Heuristic Approach

Page 17: Representing Documents via Latent Keyphrase Inference · 26 q We have introduced a novel document representation method using latent keyphrases q Each dimension is explainable q Works

17

Page 18: Representing Documents via Latent Keyphrase Inference · 26 q We have introduced a novel document representation method using latent keyphrases q Each dimension is explainable q Works

18

q Exact inference is slow!q NP hard to compute posterior probability for Noisy-Or networksq Approximate inference insteadq Pruning irrelevant nodes using an efficient scoring functionq Gibbs sampling

Inference

Page 19: Representing Documents via Latent Keyphrase Inference · 26 q We have introduced a novel document representation method using latent keyphrases q Each dimension is explainable q Works

19

q Two text-related tasks to evaluate document representation qualityq Phrase relatednessq Document classification

q Two datasets

Experiments

Page 20: Representing Documents via Latent Keyphrase Inference · 26 q We have introduced a novel document representation method using latent keyphrases q Each dimension is explainable q Works

20

q ESA (Explicit Semantic Analysis)q KBLink uses link structure in Wikipediaq BoW (bag-of-words)q ESA-C extends ESA by replacing Wiki with domain corpusq LSA (Latent Semantic Analysis)q LDA (Latent Dirichlet Allocation)q Word2Vec is a neural network computing word

embeddingsq EKM uses explicit keyphrase detection

Methods

Page 21: Representing Documents via Latent Keyphrase Inference · 26 q We have introduced a novel document representation method using latent keyphrases q Each dimension is explainable q Works

21

Phrase Relatedness Correlation

Document Classification

Page 22: Representing Documents via Latent Keyphrase Inference · 26 q We have introduced a novel document representation method using latent keyphrases q Each dimension is explainable q Works

22

Case Study

Page 23: Representing Documents via Latent Keyphrase Inference · 26 q We have introduced a novel document representation method using latent keyphrases q Each dimension is explainable q Works

23

Page 24: Representing Documents via Latent Keyphrase Inference · 26 q We have introduced a novel document representation method using latent keyphrases q Each dimension is explainable q Works

24

Time Complexity

#Samples1000 3000 5000 7000 9000

Runin

g Ti

me

(ms)

100

200

300

400

500

AcademiaYelp

#Quality Phrases After Pruning10 100 200 300 400 500

Runin

g Ti

me

(ms)

100

200

300

400

500

AcademiaYelp

#Words0 100 200 400 800

Runin

g Ti

me

(ms)

0

500

1000

1500

AcademiaYelp

Page 25: Representing Documents via Latent Keyphrase Inference · 26 q We have introduced a novel document representation method using latent keyphrases q Each dimension is explainable q Works

25

Breakdown of Processing Time

Page 26: Representing Documents via Latent Keyphrase Inference · 26 q We have introduced a novel document representation method using latent keyphrases q Each dimension is explainable q Works

26

q We have introduced a novel document representation method using latentkeyphrases

q Each dimension is explainableq Works for short textq Works for closed-domain text

q We have developed an efficient inference method to do real time keyphraseidentification

q Future workq Better structure learning approachq Combined with knowledge baseq Try other inference method other than Gibbs sampling

q Code available at http://jialu.info

26

Conclusion