Upload
medicineanddermatology
View
1.213
Download
4
Tags:
Embed Size (px)
DESCRIPTION
Citation preview
Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
1
meow::06
David Newman
Bill Landis, ex officio
Kat Hagedorn
Clustering, Classification, and Metadata
Enhancement TechniquesJuly 24, 2006
Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
2
Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
I. Preprocessing and Topic Modeling
II. The “Browser”
III. Lessons Learned and Next Steps
Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
3
Goals
• Evaluate topical/subject-based metadata enhancement• Experiment on testbed of multiple OAI repositories• Discuss lessons learned and refine testing• Propose products and services
Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
4
What We Did
Cluster
Preprocessing & Topic Modeling >
vocab-ulary
preprocesstopic
model(cluster/learn)
topicsOAI
records
Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
5
What We Didvocab-ulary
preprocesstopic
model(cluster/learn)
topicsCluster
OAIrecords
vocab-ulary
preprocesstopic
model(classify)
1. topics in records2. records in topicsoai
rec
Classify
Preprocessing & Topic Modeling >
OAIrecords
Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
6
What We Did
Cluster
Classify
Preprocessing & Topic Modeling >
clustering is learning the
topics
classification is using the
learned topics
vocab-ulary
preprocesstopic
model(cluster/learn)
topicsOAI
records
vocab-ulary
preprocesstopic
model(classify)
1. topics in records2. records in topicsoai
recOAI
records
Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
7
Repository Selection
• Mix of cultural heritage repositories?– UMich, Library of Congress, CDL, State Lib of Victoria (Aust), …– Average of 15 words per record (excl. stopwords)– Topics often specific to collection (e.g., State Lib of Victoria)– Experience with CDL’s American West project
• Mix of scientific/research repositories?– CiteSeer, arXiv, PubMed, …– <description> is a reasonably reliable 200-word abstract– Average of 75 words per record– Topics more likely to span repositories
• For purposes of evaluation, used (mostly) English-language repositories
Preprocessing & Topic Modeling >
Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
8
Selected Repositories*
Short Name
Description Records Records used for clustering (learning)
arxiv arXiv.org Eprint Archive 368,000 1 in 3
caltech Caltech Electronic Theses and Dissertations 3,000 -
cern CERN Document Server 45,000 1 in 2
citeseer CiteSeer Scientific Literature Digital Library 717,000 1 in 3
doaj Directory of Open Access Journals Articles 29,000 1 in 2
iop Institute of Physics 212,000 1 in 3
loc Library of Congress Digitized Historical Collections 239,000 -
nsdl The National Science Digital Library 33,000 1 in 2
osti Office of Science and Technology Information 131,000 1 in 3
pangaea Publishing Network for Geoscientific and Environmental Data
370,000 -
pubmed PubMed Central 625,000 1 in 3
repec Research Papers in Economics 141,000 1 in 3
*Repositories harvested by UMich/OAIster, June 7, 2006.
Preprocessing & Topic Modeling >
Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
9
Usage of Dublin Core Fields
• Decided to use words from <title>, <description>, <subject> for clustering
• Idiosyncrasies– CiteSeer: repeats <author> and <title> in <subject>– CiteSeer: puts citations to other IDs in <description>– arXiv: puts e.g., “Comment: 12 pages PostScript” in <description>– RePEc: no <subject>, repeats ID in <description>– etc.
• Approach: Process all repositories identically, no special treatment
Preprocessing & Topic Modeling >
Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
10
Preprocessing Example
<ID=oai:CiteSeerPSU:44072>
<title>Reinforcement Learning: A Survey
<description>This paper surveys the field of reinforcement learning from a computer-science perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem faced by an agent that learns behavior through trial-and-error interactions with a dynamic environment. The work described here has a resemblance to work in psychology, but differs considerably in the details and in the use of the word "reinforcement." …
<subject>Leslie Pack Kaelbling, Michael Littman, Andrew Moore. Reinforcement Learning: A Survey
vocab-ulary
preprocess
<ID=oai:CiteSeerPSU:44072>
reinforcement learning survey
survey field reinforcement learning computer science perspective written accessible researcher familiar machine learning historical basis field broad selection current summarized reinforcement learning faced agent learn behavior trial error interaction dynamic environment resemblance psychology differ considerably detail word reinforcement …
leslie pack kaelbling littman andrew moore reinforcement learning survey
Preprocessing & Topic Modeling >
Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
11
Stopwords and Stemming• Standard: and, the, …• Research related: research, paper, data, system,
method, result, …• Repository specific: cern, citeseer, repec, Smith, …• All tokens starting with a digit: 1996, 401k, …• Produced stopword list of 500 words• Applied very simple stemming (cars car)• Note: replacing collocations improves interpretability of
topics, but not quality (los angeles los_angeles)• Don’t need to find and exclude all stopwords because
topic model will help find these (e.g. des, les, une, …) -- suppress after the fact
Preprocessing & Topic Modeling >
Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
12
Building Vocabulary
• Preprocessed (sampled) repositories, excluded stopwords• Only kept words that occurred in more than 10 records• Result: a final vocabulary with ~ 90,000 words• Most frequent words: cell, high, energy, protein, function,
algorithm, field, theory, physics, …• Resulting discussion point: When do we need to re-create
the vocabulary? (When classifying, new documents will be filtered through existing vocabulary)
Preprocessing & Topic Modeling >
Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
13
Preprocessing & Topic Modeling >
• Average of 75 words per record
• Bimodal because used records with abstracts and records without abstracts
• Topic model isn’t adversely affected by very short records
Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
14
Computation
• Clustering (Learning)D = 750,000 recordsW = 90,000 word vocabularyL = 75 words per recordT = 500 topicsiter = 500 iterationsmemory = 3DL + T(D+W) = 3 GBytetime = D L T Iter = 3 days (3 GHz Xeon)
• ClassificationD = 3,000,000 records totaliter = 40 iterationsmax memory = 2 GBytemax time = 5 hours (but repositories can run in parallel)
Decision point: How many topics?Decision point: How many iterations?
Preprocessing & Topic Modeling >
Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
15
Broad Topical Categories
• 500 topics too many to look at• Need to organize topics under broad topical
categories– Cluster the clusters (automatic)– Use pre-defined categories
• Classify group of keywords (manual + automatic)• Create hierarchy by hand (manual)
Preprocessing & Topic Modeling >
Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
16
Broad Topical Categories
broad topicalcategories
Preprocessing & Topic Modeling >
vocab-ulary
preprocesstopic
model(cluster/learn)
topicsOAI
records
topicmodel
(cluster/learn)
Cluster
Cluster the clusters
Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
17
Broad Topical Categories
Cluster
broad topicalcategories
Cluster the clusters
Classify group of keywords
vocab-ulary
preprocesstopic
model(classify)
topics organized under broad topical categories
group ofkeywords
Preprocessing & Topic Modeling >
vocab-ulary
preprocesstopic
model(cluster/learn)
topicsOAI
records
topicmodel
(cluster/learn)
Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
18
Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
I. Preprocessing and Topic Modeling
II. The “Browser”
III. Lessons Learned and Next Steps
Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
19
The “Browser”
*Based on 750,000 sampled records from 9 repositories, 500 topics
The Browser >
• PHP/MySQL browser of 3 million OAI records*• Preserving transparency for this audience• Browser not meant for end users• No search, no information architecture, etc.• http://yarra.calit2.uci.edu/meow/
Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
20
The “Browser”: http://yarra.calit2.uci.edu/meow/The Browser >
Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
21
Selected Topics: Useful• [ t201 ] learning machine training learn algorithm task examples
reinforcement inductive learned learner supervised unsupervised
• [ t482 ] labor worker employment wage market labour job unemployment wages earning panel find evidence individual participation
• [ t381 ] algebraic geometry mathematic conjecture varieties projective variety theory cohomology moduli curves prove genus rational give math
• [ t097 ] dark matter universe astrophysic cosmological cosmic background density inflation spectrum power scale cmb halo cosmology gravitational
• [ t027 ] hiv virus human immunodeficiency type envelope infection viral cd4 infected gag replication reverse aid tat gp120
• [ t365 ] waste radioactive wastes tank nuclear facilities management hanford disposal fuel storage material processing facility site level
> show all 500 sub-topics (to see all 500 topics)
The Browser >
Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
22
Selected Topics: Less Useful• [ t255 ] journal author chapter vol notes editor publication issue special
bibliography reader references appendix literature submitted topic• [ t328 ] paul mark thank andrew scott stephen alan steven miller george
martin obituaries thesis daniel prof ian• [ t384 ] supported part grant author foundation partially contract science
national nsf support advanced ccr provided center agency• [ t112 ] look people difficult thing need want fact reason help understand
think say alway try easy bad• [ t496 ] increase increased increases decrease increasing decreased
decreases observed change decreasing significant caused decline • [ t012 ] des les dan une est par sur pour qui nous sont aux ces analyse
pay cette
But junk topics alleviate the need to exhaustively find stopwords; many useless words cluster as topics which can be suppressed and very useful
to filter out French records
The Browser >
Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
23
Broad Topical Categories (BTCs)
• By clustering the clusters– worked well– mathematics, global energy resources, …– can choose desired number of broad topical
categories (e.g., 25) and thresholding
• By classifying groups of keywords– worked well too
• Then review and manually edit – include or exclude any subtopic
The Browser >
Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
24
BTCs: Clustering the clusters
The Browser >
Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
25
BTCs: Classifying group of keywords>>> Aerospace Engineering stars (15) space (18) aeronautics (20) astronautics (20) rocket (12) shuttle (12) exploration (15) lander (3) planets (7) black holes (7) quasars (7) pulsars (7) observatories (10) air traffic (10) aircraft (15)
aerospace (20)
airplanes (10)
airports (10) heliports (10) helicopters (10) aviation (18) FAA (7) airlines (12) flight (18) comets (10) meteorites (12) spacecraft (15) air force (7) pilots (7) jets (7) air travel (15) flying (18)
domain expert specifies list of
relevant keywords and (importance)
The Browser >
Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
26
BTCs: Classifying group of keywords>>> Aerospace Engineering[t192] (69%) vehicle flight vehicles engine car road speed nasa aircraft air[t352] (13%) star solar planet mass astrophysic binary dwarf orbital sun companion[t191] (8%) space spaces hilbert subspace dimensional subspaces defined exploration linear point
>>> Dermatology[t388] (83%) infection skin disease tract respiratory fever burgdorferi caused wound arthritis[t157] (8%) cancer tumor p53 breast carcinoma survival human tumour malignant prostate[t071] (7%) growth tuberculosis mycobacterium growing grow igf factor bcg avium
>>> Geology and Earth Sciences[t121] (73%) geothermal rock seismic energy mountain drilling fluid survey spring yucca[t268] (12%) sea atmospheric climate ice ocean atmosphere cloud global wind aerosol
>>> Molecular, Cellular and Developmental Biology[t276] (31%) molecular biological sciences molecules biology molecule quantitative biochemistry basic[t417] (15%) cell apoptosis cellular death cultured bcl lines hela transfected mediated[t355] (12%) brain neuron neuronal cortex synaptic cortical rat nervous cerebral dopamine[t418] (9%) genes genome gene repeat chromosome sequences dna genomic sequence region[t319] (7%) mice development mouse drosophila expression transgenic cell embryonic embryos gene
>>> Transportation[t192] (85%) vehicle flight vehicles engine car road speed nasa aircraft air
in review, would delete
this topic from this
BTC
just found 1 topic relevant to
transportation
The Browser >
Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
27
Browse Records in a Topic
nice mix of repositories
The Browser >
can navigate back to
multiple BTCs
Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
28
Browse Records in a Topic: From one repository
The Browser >
display records just from Library
of Congress
Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
29
Sample RecordMurphy's Law in algebraic geometry: Badly-behaved deformation spaces
> preprocessed textmurphy law algebraic geometry badly behaved deformation spaces
consider question bad deformation space object answer priori reason deformation space bad moduli spaces precisely singularity finite type smooth parameter hilbert scheme curves projective space moduli spaces smooth projective type surfaces higher dimensional varieties plane curves nodes cusp stable sheaves isolated threefold singularities object pathological fact nice curves smooth surfaces ample canonical bundle stable sheaves torsion free rank singularities normal cohen macaulay justifies mumford philosophy moduli spaces behaved object arbitrarily bad priori reason construct smooth curve projective space deformation space component singularity type reduced behavior subschemes similarly give surface f_p lift course hold holomorphic category difficult compute deformation spaces directly obstruction theories circumvent relating tractable deformation spaces smooth morphism essential starting point mnev universality theorem mathematic algebraic geometry mathematic complex variables
> top topics
[ t381 ] algebraic geometry mathematic conjecture varieties projective variety theory cohomology moduli curves prove genus rational give math [ t191 ] space spaces
oai:arXiv.org:math/0411469
The Browser >
link to actual OAI record
topics for this record
Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
30
Repository-specific Browsers
• Library of Congress (http://yarra.calit2.uci.edu/oai/loc/)• University of Michigan (http://yarra.calit2.uci.edu/oai/umich/)• University of Washington (http://yarra.calit2.uci.edu/oai/uwash/)• African Journals Online (http://yarra.calit2.uci.edu/oai/africa/)• and many more…
The Browser >
Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
31
Clustering, Classification, and Metadata Enhancement Techniques on OAI RecordsI. Preprocessing and Topic Modeling
II. The “Browser”
III. Lessons Learned and Next Steps
Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
32
Evaluation
• Topic modeling worked well– Most topics were useful– Drain on computer resources was reasonable– Human effort was relatively small– All repositories processed identically, no special
treatment
• Strategy worked well– Clustering, then– Classification, and– Broad Topical Categories creation
Lessons Learned & Next Steps >
Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
33
Further Evaluation
• Current processing only for– English-language repositories– Science/research based repositories
• Need to test cultural heritage repositories and foreign-language records– Less consistent descriptive language and length– “On-the-horse” problem more prevalent– Greater need to individually process repositories
• Also need usability testing to evaluate further– Depends on criteria -- who are users?
• Librarians?• End-users?
– Depends on products and services desired by users
Lessons Learned & Next Steps >
Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
34
Discussion Point: When to Re-cluster?
• Need to re-cluster– when collection changes significantly– if there is a “hole” in topics– but NOT just because you have another repository
• If you re-cluster– all topics will be different– have to discard hand-labeling– Broad Topical Categories might be different
• However, classification is– “cheap” and easy– e.g., for OAIster, could re-classify every harvest…until spring clean
clu
ste
r
cla
ssify
clu
ste
r
clu
ste
r
cla
ssify
cla
ssify
cla
ssify
cla
ssify
cla
ssify
Lessons Learned & Next Steps >
Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
35
Products and Services
• Depending on users…• What kind of service is useful?• What should interface to topics look/act like?• What kind of use should we envision?• We have some ideas…
Lessons Learned & Next Steps >
Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
36
Archive of Topics
• Are the topics we created useful to anyone else?• Scenario: librarian uses topics/classifier for local
resources• To use locally you need:
– the preprocessor (i.e. the preprocessing rules)– the vocabulary (file of 90,000 words)– the topic model classifier
Lessons Learned & Next Steps >
Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
37
Subject Search/Browse for OAIster
• Integrate topics into OAIster– add to records so can perform canned topic search– add to interface so can browse BTCs to records
• Additionally, can allow users to find records similar to those retrieved– e.g., retrieved records on cosmology and can find
similar records on astrophysics, relativity, …
• How to do this?
Lessons Learned & Next Steps >
Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
38
How To Reach Us
• David Newman: University of California, Irvine
• Kat Hagedorn: University of Michigan
• Bill Landis: California Digital Library