17
ailab.ijs.si Approximate subgraph matching for detection of topic variations Mitja Trampuš Dunja Mladenić AI Lab, Jožef Stefan Institute, Slovenia

Diversiweb2011 07 Approximate subgraph matching - Mitja Trampus

Embed Size (px)

Citation preview

Page 1: Diversiweb2011 07 Approximate subgraph matching - Mitja Trampus

ailab.ijs.si

Approximate subgraph matching for detection of topic variations

Mitja Trampuš

Dunja MladenićAI Lab, Jožef Stefan Institute, Slovenia

Page 2: Diversiweb2011 07 Approximate subgraph matching - Mitja Trampus

AI Lab, Jozef Stefan Institute

ailab.ijs.si

Mining Diversity• Web content varies in many aspects, e.g.

o Topical

o Social (author, target audience, people written about)

o Geographical (publisher, places written about)

o Sentiment (positive/negative)

o Writing style (structure, vocabulary)

o Coverage bias

• This work: (micro-)topical diversityo Macroscopic = largely solved

o Microscopic = challenge

Page 3: Diversiweb2011 07 Approximate subgraph matching - Mitja Trampus

AI Lab, Jozef Stefan Institute

ailab.ijs.si

Task:Given a collection of texts on a topic,

• identify a common template

• align texts to the template

Page 4: Diversiweb2011 07 Approximate subgraph matching - Mitja Trampus

AI Lab, Jozef Stefan Institute

ailab.ijs.si

Template representation• Syntactic

o info1: X people were killed / killed X people / resulted in X

casualties

o info2: blew up Y / destroyed Y / attacked in a Y

• Semantico kill(bomber, people); count(people, X)

o destroy(bomber, Y)

people bomber Ykill destroyX count

Page 5: Diversiweb2011 07 Approximate subgraph matching - Mitja Trampus

AI Lab, Jozef Stefan Institute

ailab.ijs.si

patients terrorist hospitalkill demolish100 count

treatment attack

receive withstand

execute

policeofficer

bomberpolicestation

slaughter blow up2 count

Pre

req

uis

ite:

S

eman

tic

Gra

ph

believerssuicidebomber

churchkill destroy12 count

vestexit

cardrive

wearrun

Page 6: Diversiweb2011 07 Approximate subgraph matching - Mitja Trampus

AI Lab, Jozef Stefan Institute

ailab.ijs.si

Mining Templates• Template := subgraph with frequent

specializationso Specializations implied by background taxonomy

(WordNet)

o Threshold frequency manually defined

believerssuicidebomber

churchkillteardown

12 countpoliceofficer

bomberpolicestation

slaughter blow up2 count

people bomber buildingkill destroyX count

Page 7: Diversiweb2011 07 Approximate subgraph matching - Mitja Trampus

AI Lab, Jozef Stefan Institute

ailab.ijs.si

Semantic Graph Construction

1. Data: Google News crawl

2. HTML cleanup

3. Named entity tagging

4. Pronoun resolution (he/she/him/her)

5. Named entity consolidation (Barack Obama vs President Obama)

6. Parsing, triple/fact/assertion extraction

(for now: subj-verb-obj only)

7. Ontology/taxonomy alignment

8. Merging triples into a graph

Page 8: Diversiweb2011 07 Approximate subgraph matching - Mitja Trampus

AI Lab, Jozef Stefan Institute

ailab.ijs.si

Approximate subgraph matching

believerssuicidebomber

churchkillteardown

12 count

policeofficer

bomberpolicestation

slaughter blow up2 count

people person locationkill destroynumber

count

people person locationkilldestroynumber count

people person locationkill destroynumber count

GENERALIZE

FREQUENT SUBTREE MINING

Page 9: Diversiweb2011 07 Approximate subgraph matching - Mitja Trampus

AI Lab, Jozef Stefan Institute

ailab.ijs.si

Approximate subgraph matching

people bomber buildingkill destroynumber count

people person locationkill destroynumber count

SPECIALIZE

believerssuicidebomber

churchkillteardown

12 countpoliceofficer

bomberpolicestation

slaughter blow up2 count

Page 10: Diversiweb2011 07 Approximate subgraph matching - Mitja Trampus

AI Lab, Jozef Stefan Institute

ailab.ijs.si

Preliminary results

• 5 test domains; for each:o ~10 graphs, ~10000 nodes

o 10-60 seconds

• At min. support 30%o 20 maximal patterns, 9 manually judged as interesting

Page 11: Diversiweb2011 07 Approximate subgraph matching - Mitja Trampus

AI Lab, Jozef Stefan Institute

ailab.ijs.si

Page 12: Diversiweb2011 07 Approximate subgraph matching - Mitja Trampus

AI Lab, Jozef Stefan Institute

ailab.ijs.si

Conclusion• Future work:

o Mapping text -> semantics

• Other ontologies?

o Interestingness measure for assertions and patterns

o Evaluation (precision, recall; multiple domains)

o Alternative approaches to generalizing subgraphs

• Template extraction is achievable, but not easy

• Human filtering of results hard to avoid

• Current approach reasonably fast

Page 13: Diversiweb2011 07 Approximate subgraph matching - Mitja Trampus

AI Lab, Jozef Stefan Institute

ailab.ijs.siThank you.

Page 14: Diversiweb2011 07 Approximate subgraph matching - Mitja Trampus

Can we extract all relations?Kind of …

Thousands of small quakes resumed 18 months ago and continue to rattle Mammoth Lakes, June Lake and other Mono County resort towns. The temblors, most measuring 1 to 3 on the Richter scale, started beneath Mammoth Mountain.

Subject – Verb – Object

Triplets

Semantic Graph

Page 15: Diversiweb2011 07 Approximate subgraph matching - Mitja Trampus

AI Lab, Jozef Stefan Institute

ailab.ijs.si

Page 16: Diversiweb2011 07 Approximate subgraph matching - Mitja Trampus

AI Lab, Jozef Stefan Institute

ailab.ijs.si

Templates - why

• Interpret contento news archives: structure/annotate old texts, enable

semantic search

o wikipedia: suggestions for infobox entries

• Generate contento wikipedia: a starting point for new articles / a checklist of

information to be included

• No normative definition of “good template”

Page 17: Diversiweb2011 07 Approximate subgraph matching - Mitja Trampus

AI Lab, Jozef Stefan Institute

ailab.ijs.si

Evaluation

• Qualitativeo Usage-specific

o Not useful for tuning algorithms

• Quantitativeo Precision

o Recall