Upload
asher-fitzgerald
View
217
Download
0
Tags:
Embed Size (px)
Citation preview
Oct 12-14, 2003 NSDL 2003 1
Challenges in Building Federation Services over Harvested Metadata
Kurt Maly, Michael Nelson, Mohammad ZubairDigital Library Group
Old Dominion UniversityNorfolk, VA 23529
Oct 12-14, 2003 NSDL 2003 2
Outline
• Motivation
• Overview
• Process Automation
• Web Services and Applications
• Performance
• Conclusions and Future Work
Oct 12-14, 2003 NSDL 2003 3
Motivation
• Harvesting provides only the basic services to get metadata from repositories.
• Processing these data or retrieving related metadata is not part of the OAI-PMH.
• Dynamic harvesting introduces challenges of keeping specialized-services consistent with ingestion of new metadata records.
Oct 12-14, 2003 NSDL 2003 4
Motivation
• There is a growing use of the Web Services standard. Hence providing services compliant with this standard will increase the usability of our digital library.
• Using web services enable 3rd parties to provide services that enhance our native services on top of our federation collection
Oct 12-14, 2003 NSDL 2003 5
OverviewArchon is a federation of physics digital libraries. Its architecture provides services to both humans and machines:
•Basic Services (for humans)– a search and discovery service; – a service to allow searching on equations embedded in the metadata, – a cross-archive citation service
•OAI Services (for machines)– a storage service for the metadata of collected archives; – a harvester service to collect data from digital libraries using OAI-PMH– a data provider service to expose metadata to OAI-PMH harvesters
•Web Services (for machines) – A focus library for personal use
Oct 12-14, 2003 NSDL 2003 6
Archon Architecture
User Interface
Search Engine (Servlet)
JDBC
Data Normalization
History Harvest
Daily Harvest
Data Provider
Data Provider Cache Relational
Database (Oracle)
Harvester
Extended Services
Search users usersusers
Publishing users
Oct 12-14, 2003 NSDL 2003 7
Process Automation
• At the core of Archon we have high level services that require post-processing of harvested metadata .
• we implemented Archon’s post-harvesting processes as tasks that can be run incrementally and automatically.
• The Archon post-processing consists of tasks for citation and equation processing, normalization, and a subject resolver.
Oct 12-14, 2003 NSDL 2003 8
Harvest Post ProcessingCitation Processing
• Reference-linking service provides the user a list of the references for each metadata record.
• Where possible the service provides links to the documents at external source archives and within Archon.
Oct 12-14, 2003 NSDL 2003 9
Harvest Post ProcessingCitation Processing
Others OAI
Harvester
Raw Reference
Parser Extract
references
Get archive
Reference Resolver
Reference Collector
References
Harvester
Parser
Raw Bibliographic
Normalization
Bibliographic
Old Link Adjustment
Extrernal Link
Crosslink
DC
Bibliographic Collector
Reference Process
Oct 12-14, 2003 NSDL 2003 10
Harvest Post Processing-Citation Processing
Oct 12-14, 2003 NSDL 2003 11
Harvest Post Processing-Citation Processing
Oct 12-14, 2003 NSDL 2003 12
Harvest Post Processing-Citation Processing Data for Resolved References
Archives Total External Internal Linked Resolved
arXiv 4,838,158 2,191,419 1,257,367 2,790,904 2,900,347
APS 686,521 427,601 195,187 432,604 520,843
CERN 58,105 24,345 9,115 25,513 27,753
Oct 12-14, 2003 NSDL 2003 13
Harvest Post Processing - Equation Processing
• We represent the equations as images and display these images when the metadata records are displayed. This requires the following tasks to be performed after harvesting new metadata records:– Identifying equations – Filtering equations – Equation storage
Oct 12-14, 2003 NSDL 2003 14
Harvest Post Processing - Equation Processing
EqnFilter EqnRecorder
Img2Gif
EqnExtractor
Acme.JPM.Encoders.GifEncoder
Eqn2Gif
cHotEqn MathEqn
EqnCleaner
Eqn Data DC Metadata
Image Converter
Formula Filter
Oct 12-14, 2003 NSDL 2003 15
Harvest Post Processing - Subject Resolvers
• Our subject resolver, tries to fill the subject field for APS and arXiv DC records.
Get parallel metadata
Parse to get PACS code
PACS Spec Map code to subject String DC
Guess subject
Oct 12-14, 2003 NSDL 2003 16
Harvest Post Processing - Statistics
#records #refsHistoricalAPS 39,064
686,521 ArXive 229,076 4,838,158CERN 17,055 58,105
NASA 38,688 N/AEmilio 3,480 N/AIncrementalAPSArXiveCERNNASA
4,05249607
66,096 0*594 12
#Equation #subject resolved37 581
25 48
*Due to lack of parallel metadata or parsed error in parallel metadata. Equation will not be processed for those whose subject is not resolved.
Archon collection
Unique Authors: 346,315
Unique Subjects:9,889
Equations (all): 330,503
#records #refs
Oct 12-14, 2003 NSDL 2003 17
Web Services and Applications
• Created web service to allow students and teachers to create personal collections.
• These services use Web Services standards including the use of SOAP requests and response in communication between the clients and the services.
• Examples of these services include:– Search Service– Book Shelf Service
Oct 12-14, 2003 NSDL 2003 18
Web Services and Applications
• Book Shelf Service – allows each user to have a personalized collection a subset of the
federation
– enables teachers to collect course materials and package it in a personalized collection
– enables students that are doing research in a topic to make a special collection that contains all the related documents in that collection.
• Search Service – provides access to all search functionality without the need to use the
Archon interface– allows each user (e.g. teacher) to provide customized client for the
collections that can have special features according to a course’s needs.
Oct 12-14, 2003 NSDL 2003 19
Oct 12-14, 2003 NSDL 2003 20
Oct 12-14, 2003 NSDL 2003 21
Web Services and Applications
Oct 12-14, 2003 NSDL 2003 22
Web Services and Applications
Oct 12-14, 2003 NSDL 2003 23
Oct 12-14, 2003 NSDL 2003 24
Conclusions and Future Work
• In our collections, we collected about 300K dc metadata for documents from APS, CERN, arXiv, Emilio and NASA.
• We also collected 30K parallel metadata records from APS.
• We have also resolved the data of 5.5M references that are cited by the above documents.
• Our performance analysis shows that we can comfortably set the scheduler of the OAI harvester to about 1 day and have a safety factor for human intervention should the automatic process break down.
Oct 12-14, 2003 NSDL 2003 25
Conclusions and Future Work
• We have developed Web Services that can be used for search and discovery of our collections.
• The developed web services can be used by other developers who want to provide customized or enhanced services or that want to build services additional to the currently provided services.
• We have also developed sample client applications such as a bookshelf client that can store a collection of documents and can be used to export them as references (in user defined formats) to help authors in writing research papers.
Oct 12-14, 2003 NSDL 2003 26
Conclusions and Future Work
• We are almost complete in the process of adding production service of federating CERN, arXiv, and APS. We are partially complete in add NASA and plan to collaborate with AIP(American Institute of Physics) to have their collections included as well. Once all these are federated and working at the high service level at a dynamic basis, the Web services should prove to be attractive particularly to authors of papers who can thus maintain their own bibliographies.
Oct 12-14, 2003 NSDL 2003 27
Future Work
• Collections have overlapping holdings, need strong de-duplication service
• Expand the personalization effort to allow students and researchers to integrate the DL information into their writing of reports and papers
• Test a role based access system that allows for each contributing collection to have different policies for different organizations