Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning Author: Chaitanya Chemudugunta America Holloway Padhraic Smyth

Modeling Documents by Combining Semantic Concepts with

Unsupervised Statistical Learning

Author: Chaitanya Chemudugunta

America Holloway

Padhraic Smyth

Mark Steyvers

Source: ISWC2008

Reporter: Yong-Xiang Chen

2

Abstract• Problem: mapping of an entire document or Web page to

concepts in a given ontology • Proposing a probabilistic modeling framework

– Based on statistical topic models (also known as LDA models)

– The methodology combines:• human-defined concepts• data-driven topics

• The methodology can be used to automatically tag Web pages with concepts – From a known set of concepts – Without any need for labeled documents – Map all words in a document, not just entities

3

Concepts from ontologies

• Ontology includes– Ontological concepts

• Associated vocabulary • The hierarchical relations between concepts

• Topics from statistical models and concepts from ontologies both represent “focused” sets of words that relate to some abstract notion

4

Words that relate to “FAMILY”

5

Statistical Topic Models• LDA is a state-of-the-art unsupervised learning techniqu

e for extracting thematic information

• Topic model – Words in a document arise via a two-stage process:

• Word-topic distribution: p(w|z) • Topic-document distributions: p(z|d)

– Both learned in a completely unsupervised manner • Gibbs sampling assign topics to each word in the corpus

– The topic variable z plays the role of a low-dimensional representation of the semantic content of a document

6

Topic

• A topic is in the form of a multinomial distribution over a vocabulary of words

• Topic can in a loose sense be viewed as a probabilistic representation of a semantic concept

7

How the statistical topic modeling techniques “overlay” probabilities on concepts within ontology?

1. C: amount of human-defined concepts – Each concept cj consists of a finite set of Nj unique words

2. A corpus of documents such as Web pages • How to merge these two sources of information based

on topic model?– “tagging” documents with concepts from the ontology, but with

little or no supervised labeled data available

• The concept model: replace topic z with concept c

• Unknown parameters: p(wi|cj) and p(cj |d) • Goal: to estimate these from an appropriate corpus

8

Concept model VS Topic model

• In the concept model – The words that belong to a concept are

defined by a human a priori – Limited to a small subset of the overall

vocabulary

• In the topic model – All words in the vocabulary can be associated

with any particular topic but with different probabilities

9

Treat concepts as “topics with constraints”

• The constraints consist of setting words that are not a priori mentioned in a concept to have probability 0

• Use Gibbs sampling to assign concepts to words in documents – The additional constraint is that a word can only be as

signed to a concept that it is associated with in the ontology

10

Estimate unknown parameters

• To estimate p(wi|cj) – Count how many words in the corpus were as

signed by the sampling algorithm to concept cj and normalize these counts to arrive at the probability distribution p(wi|cj)

• To estimate p(cj |d) – Count how many times each concept is assig

ned to a word in document d and again normalize and smooth the counts to obtain p(cj |d)

11

Example

• Learned probabilities for words (ranked highest by probability) for two different concepts from the CIDE ontolgy, after training on the TASA corpus

12

Usage of concept model

• Use the learned probabilistic representation of concepts to map new documents into concepts within an ontology

• Use the semantic concepts to improve the quality of data-driven topic models

13

Variations of the concept model framework Concept L

• Concept U (concept-uniform) – The word-concept probabilities p(wi|cj) are defined to be uniform for all

words within a concept – Baseline model

• Concept F (concept-fixed) – The word-concept probabilities are available a priori as part of the conce

pt definition • For both of these models, Gibbs sampling used as before, but the p

(w|c) probabilities are held fixed and not learned • Concept LH, Concept FH

– Incorporate hierarchical information such as a concept tree– An internal concept node is associated with its own words and all the wo

rds associated with its children • ConceptL+Topics, ConceptLH+Topics

– Incorporation of unconstrained data-driven topics alongside the concepts

– Allowing the Gibbs sampling procedure to either assign a word to a constrained concept or to one of the unconstrained topics

14

Experiment setting (1/2)

• Text corpus– Touchstone Applied Science Associates

(TASA) dataset – D = 37, 651 documents with passages

excerpted from educational texts – Divided into 9 different educational topics – Only focus on the documents classified as

SCIENCE(D = 5356, word tokens = 1.7M ) and SOCIAL STUDIES(D = 10, 501, word tokens = 3.4M)

15

Experiment setting (2/2)

• Concepts – The Open Directory Project (ODP), a humanedited hi

erarchical directory of the web • Contains descriptions and urls on a large number of hierarchi

cally organized topics • Extracted all the topics in the SCIENCE subtree, which consi

sts of C =10, 817 nodes – The Cambridge International Dictionary of English (CI

DE)• Consists of C = 1923 hierarchically organized semantic categ

ories • The concepts vary in the number of the words with a median

of 54 words and a maximum of 3074. • Each word can be a member of multiple concepts

16

Tagging Documents with Concepts

• Assigning likely concepts to each word in a document, depending on the context of the document

• The concept models assign concepts at the word level, so the document can be summarized at multiple levels of granularity – Snippets, sections, whole Web pages…

17

Using the ConceptU model to automatically tag a Web page with CIDE concepts

18

Using the ConceptU model to automatically tag a Web page with ODP concepts

19

Tagging at the word level using the ConceptL model

20

Language Modeling Experiments • Preplexity

– Lower scores are better since they indicate that the model’s distribution is closer to that of the actual text

– 90% of the documents being used for training – 10%for computing test perplexity

21

Perplexity for various models

22

Topic model vs ConceptsLH+Topics

23

Varying the training data

24

Conclusion

• The model can automatically place words and documents in a text corpus into a set of human-defined concepts

• Illustrated how concepts can be “tuned” to a corpus to obtain a probabilistic language model leading to improved language models