Bibliographic metadata (including citation)

1. Bibliographic metadata (including citation) Tuesday 7 thJuly 2009 AMG 2 ndworkshop,University of Leicester , Leicester www.bath.ac.uk UKOLN is supportedby: Alexey Strelnikov Research Officer UKOLN Contributions from Emma Tonkin

Introduction 3. What and why 4. Use cases 5. Key points 6. Issues 7. Recommendations

Metadata extraction is the process of describing extrinsic and intrinsic qualities of a resource

Bibliographic metadata is a particular case of metadata extraction. 10. For example: 11. Title 12. Authors 13. Emails 14. Citations

General metadata extraction tends to involve machine learning 16. Citation and reference analysis usually involves regular expressions 17. Might involve visual structure analysis and text mining

In order to improve long/boring manual operations with metadata:

Generation metadata on document deposit 19. Revision of metadata 20. Comparison and aggregation 21.

Automatic extraction can make a system more robust (in addition to existing approaches) 23. It is not a drop-in replacement for manual creation, but semi-automated feature extraction can make for better metadata quality overall

Dominik is a researcher, publishing his new paper 25. Instead of fully manual deposit (typing in all values) he makes use of system suggestions, which make the process faster andsimpler

Fiona is a researcher, assessing impact made by her paper 27. How many citations of my work? 28. Network of citations (existing system: Google scholar, citeseer.net...)

Bob is a repository manager, checking inconsistency in the repository's metadata 30. Make use of system recommendations, and a generated value confidence level 31. Easier to find invalid or obsolete metadata values

Edward is an application profile/standard curator, checking inter-repository metadata 33. Have application profile, but no feedback on how it is followed 34. Consistent errors:

Not filled 35. Systematically wrong value (might be related to research field, environment)

Comparison & aggregation report

All approaches have a manual analogue 37. Automated metadata extraction would be an improvement, but not replacement 38. Service isinvisible , it just makes suggestions: for example 'the metadatafieldtitle should be Some name'

Standards - involved in the workflow make a big impact

The nice thing about standards is that there are so many of them to choose from Andrew S. Tanenbaum

Tools existing applications to extract metadata

Should consider a number of standards for representation, format, as well as languages and locales

Document encoding 41. Metadata encoding 42. Locale specifics 43. Citation formats

Document encoding

Important because this may impact correct reading of a resource 45. Document formats:

PDF, Doc, PPT, etc.

Font encoding:

UTF, locale specific

Metadata encoding

This has a direct impact on the result's usability in a given context 47. Examples of metadata standards:

OAI-DC 48. SWAP 49. LOM 50. OAI-ORE 51. MARC

Locale specifics

Country and culture specific formats of text elements 53. For example:

Right-to-left languages 54. Date format:

dd/mm/yyyy 55. mm/dd/yyyy

Citation and reference formats

There exist many citation/reference formats, different standards exist for most research fields 57. For example:

APA social sciences 58. MLA literature and the arts 59. AMA - biology 60. Turabian multi-field 61. Chicago standard publications 62. Harvard, Numerical, MHRA - multi-field

Automated metadata extraction is a workflow, which involves several interconnected software systems 64. Helps to overcome standards heterogeneity

Examples of existing tools:

DC-dot (variety of doc/web formats -> DC metadata) 66. DepositPlait (var. format metadata -> metadata repository) 67. DataFountains (var. format->metadata) 68. paperBase (prototype concentrating on eprint documents)

Full-text resource availability 70. Readability of the text 71. Legal issues 72. Engineering constraints - machine suggestions might be imperfect 73. Language & localization - need to retrain system for the other locale

A robust system that is easy to retrain, customizable input & outputs plugins

A potential gain:

Simplify (re)extraction of metadata, faster repository operations, validation

Making use of confidence level assigned to the metadata field

A potential gain:

Identifying possibly incorrect metadata records

Make full-text document available to the system

A potential gain:

Periodical re-exploration of the resource and updating the metadata

Investigate the problem of analysing citation

A potential gain:

Assess level of similarity between papers 76. Classify paper nature

Thank you for your attention

Education

Bibliographic metadata (including citation)