Upload
marko-rodriguez
View
1.389
Download
0
Tags:
Embed Size (px)
DESCRIPTION
In spite of its tremendous value, metadata is generally sparse and incomplete, thereby hampering the effectiveness of digital information services. Many of the existing mechanisms for the automated creation of metadata rely primarily on content analysis which can be costly and inefficient. The automatic metadata generation system proposed in this article leverages resource relationships generated from existing metadata as a medium for propagation from metadata-rich to metadata-poor resources. Because of its independence from content analysis, it can be applied to a wide variety of resource media types and is shown to be computationally inexpensive. The proposed method operates through two distinct phases. Occurrence and co-occurrence algorithms first generate an associative network of repository resources leveraging existing repository metadata. Second, using the associative network as a substrate, metadata associated with metadata-rich resources is propagated to metadata-poor resources by means of a discrete-form spreading activation algorithm. This article discusses the general framework for building associative networks, an algorithm for disseminating metadata through such networks, and the results of an experiment and validation of the proposed method using a standard bibliographic dataset.
Citation preview
Automatic Metadata Generation Using Associative-Networks
Marko A. RodriguezCCS-3 ‘Tech Talk’December 7, 2005
http://www.soe.ucsc.edu/~okram
Resources and Metadata
• A resource is any digital-object (e.g. manuscripts, images, video, audio, etc.).
• A resource’s metadata record is a list of attributes describing the resource
[ EXAMPLE MANUSCRIPT METADATA ] Authors, Institutions, Keywords, Subject Categories, Citations, Year, Publishing Journal, Usage Data
Metadata Record<?xml version="1.0" encoding="UTF-8" ?> <OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <responseDate>2005-09-07T15:25:04Z</responseDate> <request verb="GetRecord" identifier="oai:arXiv.org:cs/0412047" metadataPrefix="oai_dc">http://arXiv.org/oai2</request> <GetRecord> <record> <header> <identifier>oai:arXiv.org:cs/0412047</identifier> <datestamp>2004-12-14</datestamp> <setSpec>cs</setSpec> </header> <metadata> <oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"> <dc:title>A Social Network for Societal-Scale Decision-Making Systems</dc:title> <dc:creator>Rodriguez, Marko</dc:creator> <dc:creator>Steinbock, Daniel</dc:creator> <dc:subject>Computers and Society</dc:subject> <dc:subject>Data Structures and Algorithms</dc:subject> <dc:subject>Human-Computer Interaction</dc:subject> <dc:subject>H.4.2</dc:subject> <dc:subject>J.7</dc:subject> <dc:subject>K.4.m</dc:subject> <dc:description>In societal-scale decision-making systems the collective is faced ...</dc:description> <dc:description>Comment: Dynamically Distributed Democracy algorithm</dc:description> <dc:date>2004-12-10</dc:date> <dc:type>text</dc:type> <dc:identifier>http://arxiv.org/abs/cs/0412047</dc:identifier> <dc:identifier>North American Association for Computational Social and Organizational Science Conference Proceedings 2004</dc:identifier> </oai_dc:dc> </metadata> </record> </GetRecord></OAI-PMH>
Problem Statement
• Metadata is costly to generate by hand
• Metadata is hard to extract from raw resource (e.g. audio, video)
• How can we automatically generate metadata for atrophied resource records?
General System Overview
• Generate resource relations with existing metadata in the repository.– occurrence and/or co-occurrence networks
• Propagate metadata from metadata rich resources to metadata limited resources– encapsulate metadata in discrete particles
and disseminate them over the generated associative network
HEP-TH 2003 Semantic Network
A1
P1
Autho
r
of
O1
J1J2
K1
K2
T1
T2
A2
A3P2
O2
P3
P4
P5
cite
s
Aut
hor o
f
Published
journal
Published
journal
Has ke
ywor
d
Has keywordAuthor
of
Author of
Author of
Organization of
Organization of
Publishedtime
Publis
hed
time Published time
Author of
Organizationof
Publis
hed
time
Haskeyword
cites
Publishedjournal
c
ites
cite
s
A4Author
of
Transforming the Semantic Network
Convert the multi-node network into a collection of manuscripts with their associated attributes (metadata record).
– manuscript• Authors• Citations• Publication Date• Keywords• Organizations• Journal
resource
metadata record
Occurrence/Co-Occurrence
• Citation: two manuscripts are connected if one manuscript cites the other.
• Co-Author: two manuscripts are connected if they share the same authors
• Co-Citation: two manuscripts are connected if they share the same authors
• Co-Keyword: two manuscripts are connected if they share the same keywords
• Co-Organization: two manuscripts are connected if they share the same organizations
• Co-Date: two manuscripts are connected if they share the same publication date
• Co-Journal: two manuscripts are connected if they share the same journal
Network Generation Running Times
• Occurrence: O(N)– Each resource’s metadata record much be
checked once and only once for a direct reference to another resource.
• Co-occurrence: O([N2 – N] / 2)– Each resource’s metadata record much be
check against every other resource’s (N2), except itself (-N), once and only once (1/2).
A B
A B
C
Particle Propagation
• Every resource is given one particle, p_i. This particle contains all the metadata associated with its resource.
• A particle also has an energy value, e_i. The further the particle travels (edge steps), the more its energy value decays.
e_i(t+1) = e_i(t) * (1-\delta)
Particle Propagation
• The particle takes an outgoing edge of its current node based on the probability distribution of its outgoing edge set. If the resource it encounters doesn’t have metadata of a particular type, it recommends that resource its metadata weighted by its energy value.
Metadata Recommendations
• Manuscript A– Journal
• Journal of Complexity [0.2457]• Journal of Information Science [0.1]• Information Processing and Management [0.001]
recommendation strength
Mini-Break
Terrorist Alert
System Parameters
• Metadata Density: to validate the algorithm we kill a percentage of the metadata in the system and see if we can reconstruct it using the algorithm (d \in [0,1])
• Metadata Percentile: only those metadata tags in the pth percentile are accepted as valid metadata (p \in [0,1])
** Validation is based Precision and Recall values
Results for Co-Author Network(Citation Metadata)
Results for Co-Author Network (Organization Metadata)
Results for Co-Author Network (Keyword Metadata)
Results for Co-Keyword Network(Citation Metadata)
Results for Co-Keyword Network(Journal Metadata)
Results for Citation Network(Author Metadata)
Results for Citation Network(Keyword Metadata)
Results for Citation Network(Journal Metadata)
Take Home Points
• Different edge types are better a propagating different metadata types.
• Can work for any resource type as long as there exists some preliminary vetted metadata and a way to create resource relations. (if there is pre-existing metadata then resource relations can be automatically created).
Future Work (part 1)
• What about path types? e.g. take a co-author edge, then a citation edge, etc. Better precision and recall?
• Explore usage metadata (applicable to any resource type—and allows for cross resource relations (e.g. manuscripts connected to audio)). The weight between two resources is a function of the interval between their download from the same IP. (Bollen, et.al. 2004)
Future Work (part 2)
• Application to social-networks? Given an unknown individual, infer his attributes according to his social-relationships
how does ‘work_with’ differ from ‘married_to’? They share same income metadata and religious belief metadata, respectively.
Conclusion
• Good life…
Rodriguez, M.A., Bollen, J., Van de Sompel, H., “Automatic Metadata Generation using Associative Networks”, [unpublished], 2005.
Know of a good journal venue?