Upload
accessinnovations
View
547
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Presented by Bob Kasenchak of Access Innovations, Inc. at the 2014 Special Libraries Association (SLA) annual meeting in Vancouver, British Columbia on June 7, 2014.
Citation preview
I DON’T
HAV
E TIM
E FOR
METADAT
A!
SL A
20
14
Bob KasenchakProject CoordinatorAccess [email protected]@taxobob
DISCLAIMER
I Don’t Have Time for Metadata!
OUTLINE
• Data• Structured Data• Unstructured Data
• Metadata• Subject Metadata• Entity (author, institution) Metadata• Document Type Metadata
• Automating Metadata• Heuristic/Statistical/Inferential• Rule-based
I Don’t Have Time for Metadata!
CASE STUDIES
I Don’t Have Time for Metadata!
STRUCTURED VS. UNSTRUCTURED DATA
Present different problems – and possible solutions – for automatically adding metadata
I Don’t Have Time for Metadata!
STRUCTURED VS. UNSTRUCTURED DATA
I Don’t Have Time for Metadata!
Association, in view of abuses and lack of consistency in published reports, has asserted that the all-inclusive income statement, containing all income items recognized as determinants of net income, is the answer to these questions.2 The Securities and Exchange Commission has alsostrongly favored this solution. 3 On the 1 Committee on Accounting Procedure, AmericanInstitute of Accountants, "Income and Earned Surplus," Accounting Research Bulletin No. 32 (December, 1947). 2 (1) "A Tentative Statement of Accounting Principles Affecting Corporate Reports," THE ACCOUNTING REvIEw, June, 1936, pp. 187-191; (2) Accounting
UNSTRUCTURED
STRUCTURED VS. UNSTRUCTURED DATA
I Don’t Have Time for Metadata!
<volume>325</volume><issue>5945</issue><fpage seq="c">1206</fpage><lpage>1206</lpage><history><date date-type="received"><day>26</day><month>02</month><year>2009</year></date><date date-type="accepted"><day>11</day><month>08</month><year>2009</year></date></history><permissions><copyright-statement>Copyright © 2009</copyright-statement><copyright-year>2009</copyright-year><copyright-holder>Your name here</copyright-holder></permissions><abstract><p>Our extended ontogenetic growth model is a theoretical model based on conservationof energy and general biological mechanisms underlying ontogenetic growth. We do notbelieve that the comments of Makarieva <italic>et al</italic>. and Sousa <italic>et al</italic>. expose substantive problems with our model. Nevertheless, they raiseinteresting, still unresolved questions and point to philosophical differences about the roleof theory and of simple, general models as opposed to complicated, specific models.</p></abstract>
STRUCTURED
STRUCTURED VS. UNSTRUCTURED DATA
• Just extracting basic information• Author• Institution• Title• Document type• Accession number(s)
…can be a challenge.
However…
I Don’t Have Time for Metadata!
STRUCTURED VS. UNSTRUCTURED DATA
• Predictability
• Positionality
I Don’t Have Time for Metadata!
Journal name/Issue/Vol./etc.
Article Title
Copyright info
Author info
Abstract
UNSTRUCTURED DATA => STRUCTURED DATA!
<journal>Transactions on Vehicular Technology</journal>
<article-title>Relationship of Average Transmitted and Received Energies in Adaptive Transmission</article-title>
<authors><author-surname>Kotelba</author-surname><author-firstname>Adrian</author-firstname><affiliation>Member, IEEE</affiliation></authors>
<copyright-info><copyright-date>2009</copyright-date></copyright-info>
<abstract><p>This paper studies the…</p></abstract>
NOTE: Some cleanup may be required
I Don’t Have Time for Metadata!
STRUCTURED VS. UNSTRUCTURED DATA
• Basic information already tagged, labeled, and easy to extract• Author info• Title• Journal/Volume/Issue etc.
• We can add semantic (or subject) metadata• Targeting only those parts of the text we require• Title• Abstract• Full text body• Exclude references, etc.
I Don’t Have Time for Metadata!
SEMANTIC METADATA
Uncontrolled Automatic keyword extraction Crowdsourced/folksonomic tags
Controlled – from a Thesaurus (or Taxonomy…) Inferential (Heuristic; Statistical) Rule-based
I Don’t Have Time for Metadata!
SEMANTIC METADATA: HOW?
Controlled – from a Thesaurus (or Taxonomy…) Inferential (Heuristic; Statistical) Rule-based
Manual tagging Automatic tagging
I Don’t Have Time for Metadata!
SEMANTIC METADATA: MANUAL ENTRY
I Don’t Have Time for Metadata!
SEMANTIC METADATA: MANUAL ENTRY
I Don’t Have Time for Metadata!
A Thought Experiment
• Let’s say a manual indexer can index 10 records/hour• Let’s say the manual indexers are perfectly consistent (they’re
not)• Let’s say your manual indexers are paid $10/hour (good luck with
that)
If you have 10,000 articles/pieces of content:It would take a manual indexer 1000 hours (25 weeks) and cost $10,000
If you have 100,000 articles:It would take a manual indexer 10,000 hours (250 weeks, or almost 5 years) and cost $100,000
If you have 1,000,000 articles:It would take a manual indexer 100,000 hours (~48 years) and $1,000,000
SEMANTIC METADATA: AUTOMATED
I Don’t Have Time for Metadata!
SEMANTIC METADATA: WHY?
Disambiguate the ambiguous Specify most specific topics Improve information retrieval
Search Browse
Enable advanced analytics
I Don’t Have Time for Metadata!
SEMANTIC METADATA: DISAMBIGUATION
“Mercury”
I Don’t Have Time for Metadata!
?
SEMANTIC METADATA: SPECIFICATION
Beyond exact string matches: Synonymy
Fiber optic gyroscopes Fiber optic gyrosFiber-optic gyroscopes Fiber-optic gyrosFibre optic gyroscopes Fibre optic gyrosFibre-optic gyroscopes Fibre-optic gyrosFiberoptic gyroscopes Fiberoptic gyrosOptical fiber gyroscopes Optical fiber gyrosOptical fibre gyroscopes Optical fibre gyrosFOGs FOG’s
I Don’t Have Time for Metadata!
SEMANTIC METADATA: SPECIFICATION
Beyond exact string matches: Context. Matters.
Indexing to most specific term
- Microscopes- Electron microscopes
- Scanning electron microscopes
I Don’t Have Time for Metadata!
SEMANTIC METADATA: WHY?
Improving information retrieval (Search, Browse)
SEARCH ≠ BROWSE
I Don’t Have Time for Metadata!
SEMANTIC METADATA: WHY?
Improving information retrieval: Search
Allows user to search by tags Ensures consistent and reliable retrieval Speeds electronic search
I Don’t Have Time for Metadata!
SEMANTIC METADATA: WHY?
Improving information retrieval: Search
I Don’t Have Time for Metadata!
SubjectMetadata
SEMANTIC METADATA: WHY?
Improving information retrieval: Search
I Don’t Have Time for Metadata!
Metadata-basedSearch
ResultsBased onmetadata
SEMANTIC METADATA: WHY?
Improving information retrieval: Browse
I Don’t Have Time for Metadata!
Taxonomybrowse
ResultsBased onmetadata
SEMANTIC METADATA: WHY?
Improving information retrieval: Browse
I Don’t Have Time for Metadata!
Taxonomybrowse
AdditionalSearchfilters
SEMANTIC METADATA: WHY?
Improving information retrieval: Analytics
Combine subject metadata with metadata about Authors Institutions Publications (Journals, Magazines, etc.) Publication Types
…to create detailed informatics about your data, users, authors, and whatever else is relevant or useful
I Don’t Have Time for Metadata!
SEMANTIC METADATA: WHY?
Improving information retrieval: Analytics
I Don’t Have Time for Metadata!
Taxonomyterm
Narrowerterms
BroaderTerm(s)
Authors who publishon this topic
I DON’T HAVE TIME FOR METADATA!
I Don’t Have Time for Metadata!
Since Metadata allows you to do things you already have
want
need to do:
It’s always time for metadata.
I DON’T
HAV
E
IT’S
ALW
AYS T
IME F
OR
METADAT
A!
SL A
20
14
Bob KasenchakProject CoordinatorAccess [email protected]@taxobob
Thank you!