22
Concept and Theme Discovery through Probabilistic Models and Clustering Qiaozhu Mei Oct. 12, 2005

Concept and Theme Discovery through Probabilistic Models and Clustering Qiaozhu Mei Oct. 12, 2005

Embed Size (px)

Citation preview

Page 1: Concept and Theme Discovery through Probabilistic Models and Clustering Qiaozhu Mei Oct. 12, 2005

Concept and Theme Discovery through Probabilistic Models and ClusteringQiaozhu Mei

Oct. 12, 2005

Page 2: Concept and Theme Discovery through Probabilistic Models and Clustering Qiaozhu Mei Oct. 12, 2005

Concepts and Themes

Language units in biology literature mining: Terms Phrases Entities Concepts (tight groups of terms/entities

representing semantics: e.g. Gene Synonyms) Themes (loose groups of terms representing

topic/subtopics)

Page 3: Concept and Theme Discovery through Probabilistic Models and Clustering Qiaozhu Mei Oct. 12, 2005

Theme Discovery

What we’ve got now: A Generative Model to extract k themes from a collection

Each theme as a language model, represented by top probability words in a theme language model

KL Divergence to model the distance/similarity between themes;

retrieve most similar themes to a term group

k

iiid wPbBwPbdwP

1, )|(*)1()|(*):(

Page 4: Concept and Theme Discovery through Probabilistic Models and Clustering Qiaozhu Mei Oct. 12, 2005

Theme Discovery (cont.)

What we’ve got now (cont.): Use HMM to segment the whole collection with the theme

extracted Use MMR to find most representative and least redundant

phrases to represent a theme (currently using n-gram prob. as and edit distance as similarity, performance to be tuned..)

Results: http://ucair.cs.uiuc.edu/qmei2/ThemeNavigation.html

Page 5: Concept and Theme Discovery through Probabilistic Models and Clustering Qiaozhu Mei Oct. 12, 2005

Some justifications Fly collection:

Cluster 0: circadian Cluster 1: adh, evolution Cluster 2: a mixture of two topics, apoptosis and promoters Cluster 6: brain development Cluster 8: cell division Cluster 12: drosophila immunity Cluster 13: nervous systems Cluster 14: hedgehog segment Polarity gene Cluster 16: Histone, Polycomb Cluster 17: visual system

Page 6: Concept and Theme Discovery through Probabilistic Models and Clustering Qiaozhu Mei Oct. 12, 2005

Theme Discovery (cont.)

Problems: How to select k? (how many themes do we believe are

there in the collection: bee collection should have smaller k than fly collection)

Can we find themes in a hierarchical manner? This can solve the former problem…however, when to cutoff?

How to represent a theme? Top words sometimes difficult to tell the semantics Phrases? Sentences?

Other possible approaches to extract theme? (LDAs, Clustering methods)

Page 7: Concept and Theme Discovery through Probabilistic Models and Clustering Qiaozhu Mei Oct. 12, 2005

Hierarchical Theme Discovery A straightforward approach (top

down splitting): Discover k themes from the initial

collection Segment the collection by the k

themes For each theme, build a sub-

collection with the segments in previous step

For each sub-collection, extract k’ themes

Do these processes iteratively Problem: When to stop splitting

iteration?

Theme1Theme2

Theme3

Collection

Theme2.1Theme2.2 Theme2.3

……

Page 8: Concept and Theme Discovery through Probabilistic Models and Clustering Qiaozhu Mei Oct. 12, 2005

Hierarchical Theme Discovery (results)

A bee collection with 929 documents

Level1: 5 themes

Level2: 3 sub-themes for each higher level theme

… … …

Page 9: Concept and Theme Discovery through Probabilistic Models and Clustering Qiaozhu Mei Oct. 12, 2005

Hierarchical Theme Discovery (results)african jelly royal european venom population africanized sting kda feral m reward subspecies proteins patients discrimination naja cue characters areas

queenworkersworkersignaljhvibrationpheromoneglandeggssignalshormonejuvenileanarchisticqueenseggiridaceaepolicingixiabehavioralage

pollinatorplantspollinationflowersplantaespermatophytaangiospermaedicotyledonespollenseedfruitangiospermsspermatophytesvasculardicotscropplantflowerpollinatorsspecies

learningbrainconditioningolfactoryneuralneuronsmushroommemorysucrosenervouscoordinationdopamineextensionantennalodorsystemproboscisbodieslobekenyon

varroamitemitesjacobsoniacarinabroodparasitecolonieshostcontrolcheliceratachelicerateshygienicvirusesinfestationdestructorpestinfestedparasitologymortality

Page 10: Concept and Theme Discovery through Probabilistic Models and Clustering Qiaozhu Mei Oct. 12, 2005

Hierarchical Theme Discovery (results)african jelly royal european venom population africanized sting kda feral m reward subspecies proteins patients discrimination naja cue characters areas

queenworkersworkersignaljhvibrationpheromoneglandeggssignalshormonejuvenileanarchisticqueenseggiridaceaepolicingixiabehavioralage

pollinatorplantspollinationflowersplantaespermatophytaangiospermaedicotyledonespollenseedfruitangiospermsspermatophytesvasculardicotscropplantflowerpollinatorsspecies

learningbrainconditioningolfactoryneuralneuronsmushroommemorysucrosenervouscoordinationdopamineextensionantennalodorsystemproboscisbodieslobekenyon

varroamitemitesjacobsoniacarinabroodparasitecolonieshostcontrolcheliceratachelicerateshygienicvirusesinfestationdestructorpestinfestedparasitologymortality

africaneuropeanpopulationpopulationspatternspatterngeneticdiscriminationmitochondrialstudiesinformationarecontrastgreentwobeeshavederivedafricasubspecies

larvaemicroorganismsgrambacteria0coloniesroyalqueenjellyeubacterianonworkersqueensproduction2nestitalian5fractionnestmates

venomrewardpatientsnajakdaproteinswaspproteindipterapla2vespulaprimateshominidaechordatavertebratamugstingspermdosequality

Page 11: Concept and Theme Discovery through Probabilistic Models and Clustering Qiaozhu Mei Oct. 12, 2005

Hierarchical Theme Discovery (results)african jelly royal european venom population africanized sting kda feral m reward subspecies proteins patients discrimination naja cue characters areas

pollinatorplantspollinationflowersplantaespermatophytaangiospermaedicotyledonespollenseedfruitangiospermsspermatophytesvasculardicotscropplantflowerpollinatorsspecies

learningbrainconditioningolfactoryneuralneuronsmushroommemorysucrosenervouscoordinationdopamineextensionantennalodorsystemproboscisbodieslobekenyon

varroamitemitesjacobsoniacarinabroodparasitecolonieshostcontrolcheliceratachelicerateshygienicvirusesinfestationdestructorpestinfestedparasitologymortality

queenworkerworkerscoloniespollenvibrationeggsforagingdevelopmentbroodsignalqueensbeesanarchisticbehavioraliridaceaelarvaeeggpheromonemay

foodforagersdancetransferenzymebiosynthesisreceiverscontrastnectarflightsourceflowwaterinformationratesddtrjcaucasianvisualgreen

mammalsvertebratesvenomnonhumanlmlmodelsmodelchordatesbeeswaxmugomegaembryomammaliavertebratahaschordatanursecolouredvg

queenworkersworkersignaljhvibrationpheromoneglandeggssignalshormonejuvenileanarchisticqueenseggiridaceaepolicingixiabehavioralage

Page 12: Concept and Theme Discovery through Probabilistic Models and Clustering Qiaozhu Mei Oct. 12, 2005

Hierarchical Theme Discovery (results)african jelly royal european venom population africanized sting kda feral m reward subspecies proteins patients discrimination naja cue characters areas

queenworkersworkersignaljhvibrationpheromoneglandeggssignalshormonejuvenileanarchisticqueenseggiridaceaepolicingixiabehavioralage

pollinatorplantspollinationflowersplantaespermatophytaangiospermaedicotyledonespollenseedfruitangiospermsspermatophytesvasculardicotscropplantflowerpollinatorsspecies

learningbrainconditioningolfactoryneuralneuronsmushroommemorysucrosenervouscoordinationdopamineextensionantennalodorsystemproboscisbodieslobekenyon

varroamitemitesjacobsoniacarinabroodparasitecolonieshostcontrolcheliceratachelicerateshygienicvirusesinfestationdestructorpestinfestedparasitologymortality

seedpercropsunflowernumbercruciferaefruithybridagricultureseedsqualitycultivarweighthelianthusoilseedcompositaeannuusyieldpollinationset

ecologyisspeciesenvironmentalsciencesfloweringfloralterrestrialpollinatorvisitingreproductionplantsccashewselfanimaliafoodinsectsfabasize

polleneephoneybeesmatingbumblebeessphivebacteriascentmimosabrazilundertakerschromatographymarksrecentlygrameubacteriacarawaymicroorganismspropolis

Page 13: Concept and Theme Discovery through Probabilistic Models and Clustering Qiaozhu Mei Oct. 12, 2005

Hierarchical Theme Discovery (results)african jelly royal european venom population africanized sting kda feral m reward subspecies proteins patients discrimination naja cue characters areas

queenworkersworkersignaljhvibrationpheromoneglandeggssignalshormonejuvenileanarchisticqueenseggiridaceaepolicingixiabehavioralage

pollinatorplantspollinationflowersplantaespermatophytaangiospermaedicotyledonespollenseedfruitangiospermsspermatophytesvasculardicotscropplantflowerpollinatorsspecies

learningbrainconditioningolfactoryneuralneuronsmushroommemorysucrosenervouscoordinationdopamineextensionantennalodorsystemproboscisbodieslobekenyon

varroamitemitesjacobsoniacarinabroodparasitecolonieshostcontrolcheliceratachelicerateshygienicvirusesinfestationdestructorpestinfestedparasitologymortality

dopaminelevelsdevelopmentagebindingpupalbrainoctopaminedivisionadultcolonieslaborglasstreatedcolonyryrpigmentationchromosomesaroliumda

beessucroseconditioningresponselearningextensionproboscispollenforagersperformancebetweenthresholdshoneybeessolutiondiscriminationstrainrateforagingconcentrationlow

imidaclopridcurrentmemorymushroomneurons1expressed4cellsantennalmbbodiescurrentsnervousbrainmvkinasereceptorstermprotein

Page 14: Concept and Theme Discovery through Probabilistic Models and Clustering Qiaozhu Mei Oct. 12, 2005

Hierarchical Theme Discovery (results)african jelly royal european venom population africanized sting kda feral m reward subspecies proteins patients discrimination naja cue characters areas

queenworkersworkersignaljhvibrationpheromoneglandeggssignalshormonejuvenileanarchisticqueenseggiridaceaepolicingixiabehavioralage

pollinatorplantspollinationflowersplantaespermatophytaangiospermaedicotyledonespollenseedfruitangiospermsspermatophytesvasculardicotscropplantflowerpollinatorsspecies

learningbrainconditioningolfactoryneuralneuronsmushroommemorysucrosenervouscoordinationdopamineextensionantennalodorsystemproboscisbodieslobekenyon

varroamitemitesjacobsoniacarinabroodparasitecolonieshostcontrolcheliceratachelicerateshygienicvirusesinfestationdestructorpestinfestedparasitologymortality

mitevarroamitesbroodjacobsoniacarinacoloniesparasiteforworkercontroladroneformicpopulationacidhost0cellstreatment

pollenbeesforagerstheirortaheatathygienicforagingproteinactivitybehaviourincreasedresponsebloodflightstripsmetabolicremoval

viruseslarvaemicroorganismsvirusbacteriaanimalpaenibacillusinfectionmolecularpathogeneubacteriagramformingendosporepositivespapventomopathogen

Page 15: Concept and Theme Discovery through Probabilistic Models and Clustering Qiaozhu Mei Oct. 12, 2005

Phrase Representations: african jelly royal european venom population africanized sting kda feral m reward subspecies proteins patients discrimination naja cue characters areas

queenworkersworkersignaljhvibrationpheromoneglandeggssignalshormonejuvenileanarchisticqueenseggiridaceaepolicingixiabehavioralage

pollinatorplantspollinationflowersplantaespermatophytaangiospermaedicotyledonespollenseedfruitangiospermsspermatophytesvasculardicotscropplantflowerpollinatorsspecies

learningbrainconditioningolfactoryneuralneuronsmushroommemorysucrosenervouscoordinationdopamineextensionantennalodorsystemproboscisbodieslobekenyon

varroamitemitesjacobsoniacarinabroodparasitecolonieshostcontrolcheliceratachelicerateshygienicvirusesinfestationdestructorpestinfestedparasitologymortality

biochemistry and molecular biophysics endocrine system chemical coordination and homeostasis molecular genetics biochemistry and molecular biophysics sense organs sensory reception animals arthropods chordates insects invertebrates mammals system chemical coordination and homeostasis vertebrata chordata animalia honey bee behavior terrestrial ecologymammalia vertebrata chordata animalia juvenile hormone queen rodentia mammalia vertebrata chordata animalia worker laid eggs vibration signal genetics biochemistry and molecular biophysics dufour s gland mammals nonhuman mammals workers egg laying queen mandibular gland pheromone nonhuman vertebrates iridaceae ixia arthropoda invertebrata animalia muridae aves vertebrata chordata animalia mug ml

Page 16: Concept and Theme Discovery through Probabilistic Models and Clustering Qiaozhu Mei Oct. 12, 2005

Hierarchical Theme Discovery (cont.) A bottom up agglomerative approach:

Find many micro-themes Group similar micro-themes into larger ones Borrow strategy from data mining:

BIRCH: incrementally form many micro-clusters, organized in a tree structure

Macro-clustering based on micro-clusters. Problem: Again, when to stop?

Page 17: Concept and Theme Discovery through Probabilistic Models and Clustering Qiaozhu Mei Oct. 12, 2005

Hierarchical Theme Discovery (cont.) Model-based approach:

Hofmann, IJCAI 99. Assume we know the collection is generated from

a hierarchical structure, use a generative model to learn the themes. (e.g. make use of GO hierarchies)

Problem: in most cases we don’t know the hierarchies.

Page 18: Concept and Theme Discovery through Probabilistic Models and Clustering Qiaozhu Mei Oct. 12, 2005

Other Research Problems

Represent a theme: Using top words: where to cut Using phrases: have to tune the MMR (many

possible strategies and parameter tuning) Using sentence? Like summarization

Themes are interesting… but how to make use of the themes?

How to evaluate themes??

Page 19: Concept and Theme Discovery through Probabilistic Models and Clustering Qiaozhu Mei Oct. 12, 2005

Concept Extraction

What we have now: N-gram algorithm (actually 2-gram): iteratively group a pair

of terms which are most likely to be replaceable considering the context of one term before/after it.

Time Complexity: O(N3), Space Complexity: now O(N2). Beespace server can deal with <= 9000 terms now (2.4g memory). (performance not evaluated due to the small data size acceptable).

Problem: based on Mutual Information, preferring 2-grams with low frequency. Doesn’t make use of farther context.

Will removing stop words help or turn down the performance?

Page 20: Concept and Theme Discovery through Probabilistic Models and Clustering Qiaozhu Mei Oct. 12, 2005

Some finding:

A small dataset: (200+ abstracts containing gene synonyms)

Only 600 iterations (merge 600 times) Most of them are reasonable, but not really useful E.g. head-to-head tail-to-tail E.g. within-locus between-locus

FBgn0000017: Dsrc Dabl FBgn0000078: amylase-null AMY-null Problem: doc-set too small, n-gram too sparse to fin

d useful concepts.

Page 21: Concept and Theme Discovery through Probabilistic Models and Clustering Qiaozhu Mei Oct. 12, 2005

Concept Extraction (cont.)

Other Possible strategy: Lin et al, KDD 02: Use feature vector to represent

terms, the weights are the mutual information between term and context feature. Thus more flexible than n-gram. (if only consider 2-gram as context features, this will be similar to what we have)

Use committee to represent a cluster, thus assures the clusters are tight and robust.

Problem: not sure how to select features

Page 22: Concept and Theme Discovery through Probabilistic Models and Clustering Qiaozhu Mei Oct. 12, 2005

Summary

Theme Extraction: Generally performs well, if we can find a good k. Hierarchical Clustering can solve this problem, but still

need to find a reasonable stop criteria. Representation is an interesting problem: MMR phrase

extraction should be further tuned Difficult to evaluate other than expert justification

Concept extraction: N-gram has space constraints: haven’t really tested the

performance… Generally, the performance should be better on large data sets

Other clustering algorithms can be explored.