Upload
brent-workman
View
33
Download
0
Embed Size (px)
DESCRIPTION
The automatic encoding of lexical knowledge in RDF topicmaps. Carol Jean Godby OCLC Online Computer Library Center March 6, 2001. Topicmaps of Web resources. For navigating complex Web sites For managing bookmark files For creating views of the Web that are organized by subject. - PowerPoint PPT Presentation
Citation preview
Carol Jean Godby
OCLC
Online Computer Library Center
March 6, 2001
The automatic encoding of lexical knowledge
in RDF topicmaps
Topicmaps of Web resources
• For navigating complex Web sites
• For managing bookmark files
• For creating views of the Web that are organized by subject
Terminology identification
• ...is an essential first step in the analysis of a document's content.
• ...is one of the most mature research subjects in natural language processing.
Lexical phrases
• Are the names of persistent concepts.
• Act like words.
• Are commonly used to name new concepts in rapidly evolving technical subject domains.
Not a lexical phrase:“Recurrent problem”
A lexical phrase:“Recurrent erosion”
Identifying lexical phrases
Tokenized text: ...Planetary scientists think the convex shape came about as lava welled up beneath the crater's solid floor….
Ngrams: planetary scientists think, convex shape, welled up, coincided with, five times greater than, easiest way, Milky Way, absolute magnitudes brighter than, added material, advanced study, African American
Index filter: planetary scientists, convex shape, easiest way, Milky Way, absolute magnitudes, added material, advanced study, African American
Topic filter: planetary scientists, Milky Way
Terminology identification: process flow
Tokenized text
Ngrams
Index filter
Topic filter
9.8m
1.9 M
59k, 2331 phrases
35k, 1632 phrases
Strategies in the topic filter
• Word/phrase frequency and strength of association
• “Knowledge-poor” text analysis
• More sophisticated but computable text analysis
Word and phrase frequencies
• Word/phrase frequency
high: dublin core, metadata, element, electronic resources
low: availability period, background, applicable terminologies
• Weighted frequency
1. core element, date element, metadata element
2. author name, entity name, corporate name
3. HTML tag, end tag, meta tag
Knowledge-poor techniques 1:
• Some noun phrase heads usually appear in text only with adjective or noun modifiers.
Example: holes--black holes, grey holes, central holes
• Others usually appear without modifiers.
Example: galaxy--cartwheel galaxy, spiral galaxy
a galaxy, our galaxy, this galaxy
Consequences
• We can identify topical single terms:
galaxy, star, sun, moon
government, abortion, communism
metadata, html, Internet, information
• We can create subject taxonomies: galaxy (-ies) *hole(s)
cartwheel galaxy black holes
elliptical galaxy drill holes
spiral galaxy grey holes
Knowledge-poor techniques 2: subject probes
• Goal: to get high-quality subject terms• Look for markers of a subject that is talked about, written about or
studied: topics in, study of, analysis of, (on the) subject of, major in…
• Probes differ in specificity. topics in sciences, arts, humanities, library science, astronomy, physics,
business, data visualization, computer science, mathematics, computer and network security, mathematics, number theory, medicine
analysis of metabolic regulation, numerical analysis, saline water phenomena, coals, iron ore, cereal grains, income dynamics among men, working hours, inflation, mass belief systems, aerial photography
Some results
The identification ofterm relationships
Singular/Plural: Library, libraries
AcronymsStandard Generalized Markup Language--SGML
Library of Congress Subject Headings--LCSH
Coordinationlibrary and information science--library science, information science
information storage and retrieval--information storage, information retrieval
cataloging and interlibrary loan--cataloging, interlibrary loan
Ellipsisabbreviated key title--abbreviated title
authority file records--authority records
A more abstract relationship: hypernym/hyponym
• “…electronic formats, such as text/HTML, ASCII, or PostScript ….”
• Other examples from our data:
Controlled Vocabularies: Medical Subject Headings, Art and Architecture Thesaurus
metadata element set: Dublin Core
protocol server applications: NFS server, FTP server, Web server
moving images: films, videos, simulations
A graph representation of relationships
Dewey Decimal
DeweyDewey
call numbers
Dewey numbers
Deweydecimal
classificationnumbers cutter
numbers
B/N
B/N
B/N
Broad/Narrow
DDC
DDC and LCSH
Library of Congress Subject Headings
SubjectHeadings
Ellipsis
Acronym
Coordination
Acronym
B/N
RDF Topic Representation
Numbers
http://r2http://r1
“numbers”
name
isDefinedIn
http://r3
broad
narrowDewey numbers
“Dewey Numbers”
name
isDefinedIn
1. Harvest Web text.
2. Extract terminology and relationships.
3. Organize terminology into an RDF graph.
4. Import the RDF graph into the Extended Open RDF Toolkit.
System flow 1: processing steps
System flow 2: User interaction
User
RDFsearch engine
TheWeb
RDFConcept
graph
A screen shot
Future plans
• Develop a user interface that fully exploits the richness of the RDF graph structure.
• Merge terminology extracted from source documents with other sources of infermation.
• Improve processes for automatically extracting terminology.
References
• The Extended Open RDF ToolkitAccessible at:
http://eor.dublincore.org/
• “Automatically generated topic maps of World Wide Web resources.”Accessible at:
http://www.oclc.org/oclc/research/publications/review99/godby/topicmaps.htm
• “The WordSmith indexing system”Accessible at:
http://www.oclc.org/oclc/research/publications/review98/godby_reighart/wordsmith.htm