A talk were given at automatic metadata extraction workshop by Intrallect and Jisc. This particular talk is about bibliographical metadata extraction in context of automated extraction.
Citation preview
1. Bibliographic metadata (including citation) Tuesday 7 thJuly
2009 AMG 2 ndworkshop,University of Leicester , Leicester
www.bath.ac.uk UKOLN is supportedby: Alexey Strelnikov Research
Officer UKOLN Contributions from Emma Tonkin
2. Agenda
Introduction 3. What and why 4. Use cases 5. Key points 6.
Issues 7. Recommendations
8. Introduction
Metadata extraction is the process of describing extrinsic and
intrinsic qualities of a resource
9. Bibliographic metadata
Bibliographic metadata is a particular case of metadata
extraction. 10. For example: 11. Title 12. Authors 13. Emails 14.
Citations
15. What and why
General metadata extraction tends to involve machine learning
16. Citation and reference analysis usually involves regular
expressions 17. Might involve visual structure analysis and text
mining
18. What and why (2)
In order to improve long/boring manual operations with
metadata:
Generation metadata on document deposit 19. Revision of
metadata 20. Comparison and aggregation 21.
22. What and why (3)
Automatic extraction can make a system more robust (in addition
to existing approaches) 23. It is not a drop-in replacement for
manual creation, but semi-automated feature extraction can make for
better metadata quality overall
24. Use case (1)
Dominik is a researcher, publishing his new paper 25. Instead
of fully manual deposit (typing in all values) he makes use of
system suggestions, which make the process faster andsimpler
26. Use case (2)
Fiona is a researcher, assessing impact made by her paper 27.
How many citations of my work? 28. Network of citations (existing
system: Google scholar, citeseer.net...)
29. Use case (3)
Bob is a repository manager, checking inconsistency in the
repository's metadata 30. Make use of system recommendations, and a
generated value confidence level 31. Easier to find invalid or
obsolete metadata values
32. Use case (4)
Edward is an application profile/standard curator, checking
inter-repository metadata 33. Have application profile, but no
feedback on how it is followed 34. Consistent errors:
Not filled 35. Systematically wrong value (might be related to
research field, environment)
Comparison & aggregation report
36. Summary for use cases
All approaches have a manual analogue 37. Automated metadata
extraction would be an improvement, but not replacement 38. Service
isinvisible , it just makes suggestions: for example 'the
metadatafieldtitle should be Some name'
39. Key points
Standards - involved in the workflow make a big impact
The nice thing about standards is that there are so many of
them to choose from Andrew S. Tanenbaum
Tools existing applications to extract metadata
40. Standards
Should consider a number of standards for representation,
format, as well as languages and locales
Important because this may impact correct reading of a resource
45. Document formats:
PDF, Doc, PPT, etc.
Font encoding:
UTF, locale specific
46.
Metadata encoding
This has a direct impact on the result's usability in a given
context 47. Examples of metadata standards:
OAI-DC 48. SWAP 49. LOM 50. OAI-ORE 51. MARC
52.
Locale specifics
Country and culture specific formats of text elements 53. For
example:
Right-to-left languages 54. Date format:
dd/mm/yyyy 55. mm/dd/yyyy
56.
Citation and reference formats
There exist many citation/reference formats, different
standards exist for most research fields 57. For example:
APA social sciences 58. MLA literature and the arts 59. AMA -
biology 60. Turabian multi-field 61. Chicago standard publications
62. Harvard, Numerical, MHRA - multi-field
63. Tools
Automated metadata extraction is a workflow, which involves
several interconnected software systems 64. Helps to overcome
standards heterogeneity
65. Examples of Tools
Examples of existing tools:
DC-dot (variety of doc/web formats -> DC metadata) 66.
DepositPlait (var. format metadata -> metadata repository) 67.
DataFountains (var. format->metadata) 68. paperBase (prototype
concentrating on eprint documents)
69. Issues
Full-text resource availability 70. Readability of the text 71.
Legal issues 72. Engineering constraints - machine suggestions
might be imperfect 73. Language & localization - need to
retrain system for the other locale
74. Recommendations
A robust system that is easy to retrain, customizable input
& outputs plugins
A potential gain:
Simplify (re)extraction of metadata, faster repository
operations, validation
Making use of confidence level assigned to the metadata field
A potential gain:
Identifying possibly incorrect metadata records
75. Recommendations (2)
Make full-text document available to the system
A potential gain:
Periodical re-exploration of the resource and updating the
metadata
Investigate the problem of analysing citation
A potential gain:
Assess level of similarity between papers 76. Classify paper
nature