View
512
Download
0
Category
Tags:
Preview:
DESCRIPTION
Citation preview
Metadata Issues for e-Prints:experiences from setting up an
Institutional Repository
Jessie HeyResearch Fellow TARDis Project
University of Southampton
ePrints UK WorkshopAshmolean Museum Oxford
22 Mar 2004
e-Prints
A simple illustration of diversity in metadata!
• EPrints (software)• e-Prints (Soton)• ePrints (UK project)• eprints (in URLs, emails)• E-print (Network – US gateway)
Searching for e-Prints in Googlee-Prints 1,200,000; eprints 225,000
Plam pilot?
• Looking for a PDA?
• Just try searching for plam pilot on eBay
• Even a sale is not incentive enough
Metadata
• The modern word for ‘Data about data’
• Generally structured data describing an e-Print in this context
• Describing an object such as a journal article or book chapter or thesis
Metadata issues for today
• Who needs the quality?• What kind of quality?
• How we approached it in TARDis– the depositor– the process– classification– mediation
• Balancing demands the pragmatic way
Who needs the quality?
Service providers (i.e. search services)
• Analysis in both e-learning and e-prints communities showed concern about quality of metadata in individual databases to give good search results when combined in cross-domain search services
Barton, Jane, Currier, Sarah and Hey, Jessie M.N. (2003) Building quality assurance into metadata creation: an analysis based on the learning objects and e-Prints communities of practice. In: 2003 Dublin Core Conference: Supporting Communities of Discourse and Practice - Metadata Research and Applications, DCMI, 39-48.http://eprints.soton.ac.uk/archive/00000020/
As I am in Oxford…
• a tribute in Elvish to JRR Tolkien from the Lord of the Rings
Gandalf on Dublin Core metadata
• ‘I cannot read the fiery letters,’ said Frodo in a quavering voice.
• ‘No’ said Gandalf ‘but I can. ……this in the Common Tongue is what is said, close enough:
• One Ring to rule them all, One Ring to find them,
• One Ring to bring them all and in the darkness bind them.’
Standards for e-Prints: Dublin Core Metadata Sets
• Define minimal metadata elements for simple resource discovery
e.g. title, creator, subject and keywords, publisher, date, rights management
• Fundamental building blocks for Open Archive Initiative compliant repositories
• Software such as GNU EPrints is OAI compliant (in DSpace may need ‘switching on’)
• Full text searching (in latest version) will give additional help to compensate for weaknesses
Who needs the quality?
• Academics (the depositors) need reasonable quality for their publication record whether full text is available or not– Tendency to think a good citation matters less if
access leads straight to the full text
An institutional repository needs• To represent their own work well• To represent their faculty and university well
• For publicity and communication• For research assessment and proposals• For promotion
What kind of quality?
• Fit for purpose – visibility and citability
• Rolls Royce or Volkswagon Golf or a Skoda?
• The Rolls Royce may not produce a sustainable repository
• Library of Congress had to think again with a backlog of millions
• A departmental archive had to scrap its editors (too slow)
• Need a model with a light touch
Examples to correct
From an academic’s current departmental publication record:
• Co-author given as Fadden on older references
• Given as McFadden on newer ones
• McFadden would not find all his papers!
Examples to correct
• Authors are not perfect but neither are information specialists or other sources
Recent examples:
• Author’s assistant put a conference in year 2400
• ‘Web of Knowledge’ put a conference in 2010
NB Amazon proved useful for checking book information from the title page (new Amazon ‘search inside’ service) but main entries may be less accurate
Quality Assurance Procedures
• Would like to pick up these and obvious examples of metadata in the wrong field eg book title used for title of chapter
• Options include regular checking (e.g at or close to time of deposit or for annual reporting) or random checking
• Visualisation techniques promising but still expensive
How we approached it in TARDis
• Looked at process from point of view of depositor– to decrease the barriers to deposit– to improve quality by design or example
• Looked at metadata required for a good citation– academics using e-print records for many purposes
not just visibility
• Some information may be easier to strip out if required but harder to add later e.g.– first name or initials – although cultural variations
too– journal title or abbreviation
Simple things deter
• Questions you can’t answer• No place to put it• Errors which force you to enter it again
• On a credit card payment– Date on the card: 06/05– Date to enter: 06/2005How many times do I do this incorrectly!
To help the depositor
• Aimed to enter information as the depositor sees it on the full text
• Arranged input in the order the information is seen
• With relevant information grouped together
• With ‘pages’ of daunting size• Fields of a size to view as much of the
text as possible
TARDis - Aiding deposit – relevant fields – relevant help
The Process
• Added help where examples are useful• Added extra buttons at top to ease
navigation• Made mandatory fields where essential• Tension between full details and
deterrent– commentary field currently not included
although some might find useful
Some ‘quality’ traditions may be less practical
• Search service recommendations: capitals only for first word of title except proper nouns
• Process is generally ‘cut and paste’ so result is variable and advice ignored
• Get Caps, non-caps, rarely ALL CAPS
• Found in practice likely to be too time consuming to insist
• Think retrieval first rather than consistency
Classification – a specific area of debate
• ePrints UK exploring automatic classification with Dewey
• TARDis looked at current practice: Reviewed subject classification in discipline
based and early institutional archivesFound whole variety of choices and levels
of complexity
TARDis on subject classification
• Discussion of issues and snapshot chart http://tardis.eprints.org
• Using basic Library of Congress with view to harvesting eg papers in Oceanography
• Added search box to find subject• Departments could use an additional scheme if they
wish (software option)• Keywords can be added (cut and paste) if available
(sometimes papers also have classification categories added for a journal)
• Computer classification generally expensive and requires learning examples but accuracy is improving
Towards the future – subject classification – on the fly
Mediation
• TARDis is experimenting with deposit choices
• Branch to:
– Self archiving (author or local assistant) with light review as pass through submission buffer
– Assisted archiving – give us the file with essential details not evident from the full text
Mediation in practice
• Current experience:
– Assisted archiving often time consuming – meeting the difficult ones - but can add value (e.g.fuller publisher location details such as DOI)
– Self archiving less accurate but author may know details which may be missing from full text
– Balance likely to change as authors become either more familiar with early deposit or perhaps happy to delegate to save time
– Learning curve for us – later may devolve some quality responsibility (use editorial options)
– Give additional feedback into software
The challenge of cutting and pasting from PDFs
• Sometimes rather like the Hyperbookworms (Jasper Fforde, The Eyre Affair)
• Who produce spurious capitals, apostrophes, hyphens
• Problems with hyphens, accents and words starting with f!
• LaTex usually the culprit so Humanities have an advantage here
Balancing demands the pragmatic way
• Author deposit changes the equation• Incentives can increase accuracy
– Deposit support– Requests by department or university or
funding council for up to date records
• Collaboration between author, department and information specialist may be best way forward
• Aim: light quality control to achieve visibility and citability
The New World of e-Prints
• Not so elegant to work in as an Oxford College Library such as Brasenose
• But should be just as satisfying to use as it meets new needs
Thank you
For further information:
TARDis http://tardis.eprints.org/
e-Prints Soton (Research Soton) http://eprints.soton.ac.uk/
FAIR Focus on Access to Institutional Resources Programme
"Improving the Quality of Metadata in Eprint Archives" Marieke Guy and Andy Powell Ariadne Issue 38 30-January-2004
Barton, Jane, Currier, Sarah and Hey, Jessie M.N. (2003) Building quality assurance into metadata creation: an analysis based on the learning objects and e-Prints communities of practice. In: 2003 Dublin Core Conference: Supporting Communities of Discourse and Practice - Metadata Research and Applications, DCMI, 39-48.http://eprints.soton.ac.uk/archive/00000020/
Recommended