38
H-Net and Scholarly Discourse in the Digital Age: A New Approach to Data Mining Email Discussion Lists William Punch Mark Lawrence Kornbluh Wayne Dyksen Michigan State University November 5, 2006

H-Net and Scholarly Discourse in the Digital Age: A New Approach to Data Mining Email Discussion Lists William Punch Mark Lawrence Kornbluh Wayne Dyksen

Embed Size (px)

Citation preview

Page 1: H-Net and Scholarly Discourse in the Digital Age: A New Approach to Data Mining Email Discussion Lists William Punch Mark Lawrence Kornbluh Wayne Dyksen

H-Net and Scholarly Discoursein the Digital Age:

A New Approach to Data Mining Email Discussion Lists

William Punch

Mark Lawrence Kornbluh

Wayne Dyksen

Michigan State University

November 5, 2006

Page 2: H-Net and Scholarly Discourse in the Digital Age: A New Approach to Data Mining Email Discussion Lists William Punch Mark Lawrence Kornbluh Wayne Dyksen

H-Net and Scholarly Discourse in the Digital Age

• Opportunity & ChallengeSearching Large, Text Archives

• New ApproachSemantic-Augmented Consensus Clustering

• ApplicationH-Net Discussion Lists

Page 3: H-Net and Scholarly Discourse in the Digital Age: A New Approach to Data Mining Email Discussion Lists William Punch Mark Lawrence Kornbluh Wayne Dyksen

IT Communication Revolution

• People: Few ManyExperts Everyone

• Speed: Slow Instant

• Quantity: Small Vast

• Style: Long Short

• Location: Limited Everywhere

• Lifetime: Short Forever

Page 4: H-Net and Scholarly Discourse in the Digital Age: A New Approach to Data Mining Email Discussion Lists William Punch Mark Lawrence Kornbluh Wayne Dyksen

Impact on Scholarly Communication

New…

• Forms of Interactivity

• Trans-Disciplinary Communities

• Participants– Producers– Consumers

• Levels of Democratization of Information

Page 5: H-Net and Scholarly Discourse in the Digital Age: A New Approach to Data Mining Email Discussion Lists William Punch Mark Lawrence Kornbluh Wayne Dyksen

Electronic Archives

• Mostly Text Based

• Exponential Growth

• Not Catalogued or Catalog-able

• Little or No Metadata

• Untapped Value– Current Users– Future Scholars

Page 6: H-Net and Scholarly Discourse in the Digital Age: A New Approach to Data Mining Email Discussion Lists William Punch Mark Lawrence Kornbluh Wayne Dyksen

Opportunity & Challenge

Large Text Archive

Information

Knowledge

Page 7: H-Net and Scholarly Discourse in the Digital Age: A New Approach to Data Mining Email Discussion Lists William Punch Mark Lawrence Kornbluh Wayne Dyksen

Typical Document Search

Data

• Words and Phrases

• Boolean Combinations

• Automatic

(“Unsupervised”)

• Not Sufficient

– Too Little

– Too Much

Metadata

• Keywords & Annotations

• Classifications

• By Hand

(“Supervised”)

• Not Scalable

– 1M Messages

– 3GB Text

Page 8: H-Net and Scholarly Discourse in the Digital Age: A New Approach to Data Mining Email Discussion Lists William Punch Mark Lawrence Kornbluh Wayne Dyksen

Our Research

• On Large Text Archives– Organization– Exploration

• Develop and Test– New Techniques– New Tools

• Interdisciplinary Large Text Archive

Information

Knowledge

Page 9: H-Net and Scholarly Discourse in the Digital Age: A New Approach to Data Mining Email Discussion Lists William Punch Mark Lawrence Kornbluh Wayne Dyksen

H-Net and Scholarly Discourse in the Digital Age

Opportunity & Challenge Searching Large, Text Archives

• New ApproachSemantic-Augmented Consensus Clustering

• ApplicationH-Net Discussion Lists

Page 10: H-Net and Scholarly Discourse in the Digital Age: A New Approach to Data Mining Email Discussion Lists William Punch Mark Lawrence Kornbluh Wayne Dyksen

Two Approaches

Very broadly there are two approaches we could use to aid a user in finding documents in a large set:

•Classification

•Clustering

Page 11: H-Net and Scholarly Discourse in the Digital Age: A New Approach to Data Mining Email Discussion Lists William Punch Mark Lawrence Kornbluh Wayne Dyksen

The Two Spiral Problem

Our little example. How to discriminate the two intertwined spirals.

= +

Page 12: H-Net and Scholarly Discourse in the Digital Age: A New Approach to Data Mining Email Discussion Lists William Punch Mark Lawrence Kornbluh Wayne Dyksen

Classification

Given k classes, find the best class in which to place a particular example

Typically two stages:

• Train the algorithm on examples from the k classes

• See how well the algorithm does on placing an unknown into the correct class

Page 13: H-Net and Scholarly Discourse in the Digital Age: A New Approach to Data Mining Email Discussion Lists William Punch Mark Lawrence Kornbluh Wayne Dyksen

Classification Example

Class 1

Class 2

Algorithm

Train Test

Trained Algorithm

?

Page 14: H-Net and Scholarly Discourse in the Digital Age: A New Approach to Data Mining Email Discussion Lists William Punch Mark Lawrence Kornbluh Wayne Dyksen

Supervised

Classification is a supervised process. We know the k classes (or we have a good idea) so we make the algorithm work properly on examples, then test how well it learned by testing it with unknowns.

Page 15: H-Net and Scholarly Discourse in the Digital Age: A New Approach to Data Mining Email Discussion Lists William Punch Mark Lawrence Kornbluh Wayne Dyksen

Clustering

Slightly different. Given a set of examples, find the “best” partitioning into k sets of those examples.

Also two stages:

1. Cluster the examples, we provide k

2. Measure somehow how well separated the examples are.

Page 16: H-Net and Scholarly Discourse in the Digital Age: A New Approach to Data Mining Email Discussion Lists William Punch Mark Lawrence Kornbluh Wayne Dyksen

Example

Algorithm

Page 17: H-Net and Scholarly Discourse in the Digital Age: A New Approach to Data Mining Email Discussion Lists William Punch Mark Lawrence Kornbluh Wayne Dyksen

Unsupervised

There is typically no training in clustering. We choose where to put a point based on some criteria of “closeness”.

As you can see, that can be hard to measure.

Page 18: H-Net and Scholarly Discourse in the Digital Age: A New Approach to Data Mining Email Discussion Lists William Punch Mark Lawrence Kornbluh Wayne Dyksen

Document ClusteringOur approach is to cluster documents (instead of points in a spiral) based on documents that are “close” to each other in meaning.

The result should be sets of documents that have something in common, especially if the process is user influenced.

Page 19: H-Net and Scholarly Discourse in the Digital Age: A New Approach to Data Mining Email Discussion Lists William Punch Mark Lawrence Kornbluh Wayne Dyksen

Three general problems we will address

• Consensus clustering

• Semantic distance measure

• Semi-supervised user influence on the clustering process

Page 20: H-Net and Scholarly Discourse in the Digital Age: A New Approach to Data Mining Email Discussion Lists William Punch Mark Lawrence Kornbluh Wayne Dyksen

One: Consensus Clustering

Two basic problems:

1. No one measure of “closeness” is often sufficient to get good clusters. Should be a combination of many such measures

2. On large document sets, any algorithm is likely expensive. However, if done on smaller subsets of the overall set, much cheaper

Page 21: H-Net and Scholarly Discourse in the Digital Age: A New Approach to Data Mining Email Discussion Lists William Punch Mark Lawrence Kornbluh Wayne Dyksen

ExampleSimplest clustering algorithm ever

invented! Draw a random line through the cluster space. One side is cluster 1, the other side cluster 2.

And the results ….

Page 22: H-Net and Scholarly Discourse in the Digital Age: A New Approach to Data Mining Email Discussion Lists William Punch Mark Lawrence Kornbluh Wayne Dyksen

Um, so why?1. The algorithm is cheap, very cheap!

Draw a line through the “space”. Cheap is good when you are worried about large numbers.

2. It turns out that multiple applications, each poor, when taken together in consensus give very good results!

3. Multiple “measures” can be accounted for this way.

Page 23: H-Net and Scholarly Discourse in the Digital Age: A New Approach to Data Mining Email Discussion Lists William Punch Mark Lawrence Kornbluh Wayne Dyksen

Two: Semantic Distance

One distance measure we would like to add to the consensus is semantic distance. How close semantically are two documents?

How to do this cheaply?

Page 24: H-Net and Scholarly Discourse in the Digital Age: A New Approach to Data Mining Email Discussion Lists William Punch Mark Lawrence Kornbluh Wayne Dyksen

Wordnet• Started by George Miller Princeton

(“The magical number 7 plus or minus 2”) in 1985. Funded to study machine translation.

• Is much more than just a dictionary. It is an ontology (in CS, that means a data model) of English.

• It includes relationships such as: hypernym, hyponym, meronym, holonym, synonym, antonym, etc.

Page 25: H-Net and Scholarly Discourse in the Digital Age: A New Approach to Data Mining Email Discussion Lists William Punch Mark Lawrence Kornbluh Wayne Dyksen

Use Wordnet to find semantic distance

How close are “dog” and “cat”?dog:

sense 1: domestic dogsense 2: unattractive girlsense 3: lucky mansense 4: a cadsense 5: hot dogsense 6: hinged catchsense 7: andiron

hypernym

canine:sense 1: toothsense 2: family Canidae

hypernym

carnivore: sense 1: meat eater

hyponymcat:sense 1: true catsense 2: guysense 3: spiteful womansense 4: teasense 5: whipsense 6: trucksense 7: lionssense 8: tomography

feline: sense 1: felid

hyponym

Page 26: H-Net and Scholarly Discourse in the Digital Age: A New Approach to Data Mining Email Discussion Lists William Punch Mark Lawrence Kornbluh Wayne Dyksen

Semantic Relationship Graphs• Ultimately will find graphs of “close word

senses” and use them to represent a document

Page 27: H-Net and Scholarly Discourse in the Digital Age: A New Approach to Data Mining Email Discussion Lists William Punch Mark Lawrence Kornbluh Wayne Dyksen

The Text

Another problem was to make governments strong enough to prevent internal disorder. In pursuit of this goal, however, rulers were frustrated by one of the strongest movements of the eleventh and twelfth centuries: the drive to reform the Church. No government could operate without the participation of the clergy; members of the clergy were better educated, more competent as administrators, and usually more reliable than laymen. Understandably, the kings and the greater feudal lords wanted to control the appointment of bishops and abbots in order to create a corps of capable and loyal public servants. But the reformers wanted a church that was completely independent of secular power, a church that would instruct and admonish rulers rather than serve them. The resulting struggle lasted half a century, from 1075 to 1122. [6a]

Page 28: H-Net and Scholarly Discourse in the Digital Age: A New Approach to Data Mining Email Discussion Lists William Punch Mark Lawrence Kornbluh Wayne Dyksen

Three: User Interaction

We want the use to be able to interact with the clustering process in a natural way (that is, not modify the algorithm).

We do this by allowing the use to establish relationships between documents:

• must-link (these docs go together)

• must-not-link (separate these docs)

Page 29: H-Net and Scholarly Discourse in the Digital Age: A New Approach to Data Mining Email Discussion Lists William Punch Mark Lawrence Kornbluh Wayne Dyksen

Changing the algorithm

As a result of changing the way documents cluster together, the user changes the algorithm (because the constraints he/she establishes must be respected across all the documents) but in a way they can understand.

Page 30: H-Net and Scholarly Discourse in the Digital Age: A New Approach to Data Mining Email Discussion Lists William Punch Mark Lawrence Kornbluh Wayne Dyksen

H-Net and Scholarly Discourse in the Digital Age

Opportunity & Challenge Searching Large, Text Archives

New ApproachSemantic-Augmented Consensus Clustering

• ApplicationH-Net Discussion Lists

Page 31: H-Net and Scholarly Discourse in the Digital Age: A New Approach to Data Mining Email Discussion Lists William Punch Mark Lawrence Kornbluh Wayne Dyksen

H-Net

• Humanities and Social Sciences OnLine

• Pioneer, Peer-Edited Discussion Lists

• 160 Networks

• 600+ Editors

• 150,000 Participants

• Global

Page 32: H-Net and Scholarly Discourse in the Digital Age: A New Approach to Data Mining Email Discussion Lists William Punch Mark Lawrence Kornbluh Wayne Dyksen

H-Net Archives

• Scholarly Value– Current Users– Future Scholars

• Scale– 1,000,000+ Messages– 3GB of Text

Page 33: H-Net and Scholarly Discourse in the Digital Age: A New Approach to Data Mining Email Discussion Lists William Punch Mark Lawrence Kornbluh Wayne Dyksen

Current Search Capabilities

By

• Date

• Author

• Subject

• Words in Text

What’s missing?

• Multi-Thread

• Multi-List

• Cross-Temporal

• Etc…

Page 34: H-Net and Scholarly Discourse in the Digital Age: A New Approach to Data Mining Email Discussion Lists William Punch Mark Lawrence Kornbluh Wayne Dyksen

Example in H-Net

• Movie Amistad was discussed across H-Net networks – History, Literature, Film, Teaching,

Economics

• Different perspectives

• Over time

Page 35: H-Net and Scholarly Discourse in the Digital Age: A New Approach to Data Mining Email Discussion Lists William Punch Mark Lawrence Kornbluh Wayne Dyksen

Value to H-Net

• Locate related content– Across time– Across scholarly communities

• Facilitate interdisciplinary scholarship and teaching

• Synthesize new knowledge in new forms

Page 36: H-Net and Scholarly Discourse in the Digital Age: A New Approach to Data Mining Email Discussion Lists William Punch Mark Lawrence Kornbluh Wayne Dyksen

Unlocking the Potential of Scholarly Communication

• Email and Forums– Popularity– Limitations

• Adding depth and breadth while maintaining immediacy

Page 37: H-Net and Scholarly Discourse in the Digital Age: A New Approach to Data Mining Email Discussion Lists William Punch Mark Lawrence Kornbluh Wayne Dyksen

Value of Humanities Technology Research

• Fundamental challenge in computer science

• Humanities research --- new insights/new connections

• H-Net provides testbed/testers

• Truly interdisciplinary research

Page 38: H-Net and Scholarly Discourse in the Digital Age: A New Approach to Data Mining Email Discussion Lists William Punch Mark Lawrence Kornbluh Wayne Dyksen

H-Net and Scholarly Discourse in the Digital Age

Contact Information:

MATRIX: Center for the Humane Arts, Letters, and Social Sciences On-Line

www.matrix.msu.edu