25
Analysis as an Interface to Digital Libraries Jan W. Buzydlowski Howard D. White Xia Lin College of Information Science and Technology Drexel University, Philadelphia, Pennsylvania, USA

Term Co-occurrence Analysis as an Interface to Digital Libraries Jan W. Buzydlowski Howard D. White Xia Lin College of Information Science and Technology

Embed Size (px)

Citation preview

Page 1: Term Co-occurrence Analysis as an Interface to Digital Libraries Jan W. Buzydlowski Howard D. White Xia Lin College of Information Science and Technology

Term Co-occurrence Analysis as an Interface to Digital Libraries

Jan W. Buzydlowski

Howard D. White

Xia Lin

College of Information Science and Technology

Drexel University, Philadelphia, Pennsylvania, USA

Page 2: Term Co-occurrence Analysis as an Interface to Digital Libraries Jan W. Buzydlowski Howard D. White Xia Lin College of Information Science and Technology

Digital Library Research

First Wave– How to store it

Next Wave– How to retrieve it (IR)

• Text Mining• Visual Information Retrieval Interface (VIRI)

Term Co-occurrence Analysis (TCA)– Co-occurrence vs. lexical associations– Maps vs. lists

Page 3: Term Co-occurrence Analysis as an Interface to Digital Libraries Jan W. Buzydlowski Howard D. White Xia Lin College of Information Science and Technology

Term Definition Unit of Analysis

– Words– Documents– Authors– Journals

Section of Focus– Abstract/Text– Title– Bibliography– Keywords

Page 4: Term Co-occurrence Analysis as an Interface to Digital Libraries Jan W. Buzydlowski Howard D. White Xia Lin College of Information Science and Technology

Example

Words in Title– Term– Co-occurrence– Analysis– Interface– Digital– Library

Authors in Bibliography– Salton-G– Chen-C– White-HD– Ding-Y– Cleveland-W– McCain-K– Lin-X– Schvaneveldt-R– Kamada-T– Fruchterman-T

Page 5: Term Co-occurrence Analysis as an Interface to Digital Libraries Jan W. Buzydlowski Howard D. White Xia Lin College of Information Science and Technology

Term Co-occurrence Methodology

User determines which terms are of interest– Via a seed term– From a pre-defined list

The system returns the pair-wise co-occurrence counts of the terms over the collection of records

Page 6: Term Co-occurrence Analysis as an Interface to Digital Libraries Jan W. Buzydlowski Howard D. White Xia Lin College of Information Science and Technology

Example Unit: Author; Section: Bibliography User Supplied List: Plato, Aristotle, Smith, Brown For a given data set (N = 4 unique terms)

– Article 1: Plato, Aristotle, Smith, …– Article 2: Plato, Smith, …– Article 3: Plato, Aristotle, Smith, Brown, …

The following co-citations (C(4,2) = 6) are found– COMBINATION COUNT ARTICLES– Plato and Smith 3 1, 2, 3– Plato and Aristotle 2 1, 3– Plato and Brown 1 3– Aristotle and Smith 2 1, 3– Aristotle and Brown 1 3– Smith and Brown 1 3

Page 7: Term Co-occurrence Analysis as an Interface to Digital Libraries Jan W. Buzydlowski Howard D. White Xia Lin College of Information Science and Technology

Term Co-occurrence Significance

The frequent co-occurrence of term pairs within a set of documents indicates a strong association between those terms, whereas a infrequent count indicates the opposite

– The association you would expect is borne out by the frequency

– The frequency you compute suggests a level of association

Pain and Management Pain and Obtainment

Plato and Aristotle Plato and Cher

Science and Nature Science and National Tattler

A and B C and D

Page 8: Term Co-occurrence Analysis as an Interface to Digital Libraries Jan W. Buzydlowski Howard D. White Xia Lin College of Information Science and Technology

Term Co-occurrence Uses

Allows a user to get a “foothold” with just one term– One seed term returns many other related

terms Allows a user to get a “overview” with

user-supplied/system-supplied terms– Co-occurrence counts with visualization

Page 9: Term Co-occurrence Analysis as an Interface to Digital Libraries Jan W. Buzydlowski Howard D. White Xia Lin College of Information Science and Technology

Seeding

User types in – One term, e.g., Plato– Boolean expression, e.g., Plato AND Brown

System supplies top n terms, in ranked order of frequency of co-occurrence with the initial term

Page 10: Term Co-occurrence Analysis as an Interface to Digital Libraries Jan W. Buzydlowski Howard D. White Xia Lin College of Information Science and Technology

Example

For Plato seed:

ARISTOTLEPLUTARCHCICEROHOMERBIBLEEURIPIDESARISTOPHANESXENOPHONAUGUSTINEHERODOTUSKANT-IAESCHYLUS

SOPHOCLESTHUCYDIDESOVIDHESIODDIOGENES-LAERTIHEIDEGGER-MDERRIDA-JPINDARNIETZSCHE-FHEGEL-GWFVERGILAQUINAS-T

Page 11: Term Co-occurrence Analysis as an Interface to Digital Libraries Jan W. Buzydlowski Howard D. White Xia Lin College of Information Science and Technology

Need for Visualization

Given a list of user- / system-supplied terms– Find the frequency of co-occurrence of each pair-wise

combination of terms• Plato AND Aristotle = 1,920• Plato AND Plutarch = 380,• …

– Too many numbers to take in at once• C(25, 2) = (25 * 24)/ 2 = 300 pairs

Three major visualization techniques– Multidimensional Scaling (MDS)– Self-Organizing (Kohonen) Maps (SOMs)– PathFinder Networks (PFNETs)

Page 12: Term Co-occurrence Analysis as an Interface to Digital Libraries Jan W. Buzydlowski Howard D. White Xia Lin College of Information Science and Technology

RR Sokal

PHA Sneath

JC Gower

JH Ward

JD CarrollJB Kruskal

VE McGee

RN Shepard

JA HartiganHA Skinner

SC Johnson

M Wish

P Arabie

RK Blashfield

PE Green

White’s MDS map of 15 co-cited classificationists, ca. 1990

Page 13: Term Co-occurrence Analysis as an Interface to Digital Libraries Jan W. Buzydlowski Howard D. White Xia Lin College of Information Science and Technology
Page 14: Term Co-occurrence Analysis as an Interface to Digital Libraries Jan W. Buzydlowski Howard D. White Xia Lin College of Information Science and Technology

White’s PFNet of co-cited authors in Biblical and literary hermeneutics, 1988-1997

SCHLEIERMACHER F

GADAMER HG

KANT I

HEGEL GWF

BARTH K

DILTHEY W

HEIDEGGER M

PLATO

BIBLE

ARISTOTLE

HABERMAS J

DERRIDA J

RICOEUR P

GOETHE JWV

BULTMANN R

FRANK M

NIETZSCHE F

TILLICH P

FICHTE JG

PANNENBERG W

TROELTSCH E

SCHELLING FWJ

SCHLEGEL FV

LUTHER M

EBELING G

Page 15: Term Co-occurrence Analysis as an Interface to Digital Libraries Jan W. Buzydlowski Howard D. White Xia Lin College of Information Science and Technology

Our System Three tiered

– User interface

– Server

– Database

Real-time and interactive Significant data sources

– ISI AHCI– MedLine

Live interface for retrieval

BRS Search EngineWeb Server

Java Servlets

Web-based Map Interface

Java Applet

MappingProcedures

Application Server

OracleDatabases

PUBMED Search Engine

Page 16: Term Co-occurrence Analysis as an Interface to Digital Libraries Jan W. Buzydlowski Howard D. White Xia Lin College of Information Science and Technology
Page 17: Term Co-occurrence Analysis as an Interface to Digital Libraries Jan W. Buzydlowski Howard D. White Xia Lin College of Information Science and Technology

User Interface - Seed

Page 18: Term Co-occurrence Analysis as an Interface to Digital Libraries Jan W. Buzydlowski Howard D. White Xia Lin College of Information Science and Technology

User Interface – SOM

Page 19: Term Co-occurrence Analysis as an Interface to Digital Libraries Jan W. Buzydlowski Howard D. White Xia Lin College of Information Science and Technology

Interface - PFNET

Page 20: Term Co-occurrence Analysis as an Interface to Digital Libraries Jan W. Buzydlowski Howard D. White Xia Lin College of Information Science and Technology

Interface - Visual Information Retrieval Interface (VIRI)

Page 21: Term Co-occurrence Analysis as an Interface to Digital Libraries Jan W. Buzydlowski Howard D. White Xia Lin College of Information Science and Technology

User Interface IV

Page 22: Term Co-occurrence Analysis as an Interface to Digital Libraries Jan W. Buzydlowski Howard D. White Xia Lin College of Information Science and Technology

Database Interface API

– String [ ] findRel( String, int )– Int [ ] findOcc( String [ ] )

Implemented on:– BRS

• API via a wrapper

– Oracle• API via JDBC

– Noah• Specialized co-occurrence database• API via JNI

Page 23: Term Co-occurrence Analysis as an Interface to Digital Libraries Jan W. Buzydlowski Howard D. White Xia Lin College of Information Science and Technology

Future Plans

User Study– Preference

• Type of map, etc.

– Cognitive map• How well does the map match experts’ mental

models

Larger datasets Additional data sources

Page 24: Term Co-occurrence Analysis as an Interface to Digital Libraries Jan W. Buzydlowski Howard D. White Xia Lin College of Information Science and Technology
Page 25: Term Co-occurrence Analysis as an Interface to Digital Libraries Jan W. Buzydlowski Howard D. White Xia Lin College of Information Science and Technology