30
I DON’T HAVE TIME FOR METAD ATA! SLA 201 4 Bob Kasenchak Project Coordinator Access Innovations [email protected] @taxobob

I Don’t Have Time for Metadata!

Embed Size (px)

DESCRIPTION

Presented by Bob Kasenchak of Access Innovations, Inc. at the 2014 Special Libraries Association (SLA) annual meeting in Vancouver, British Columbia on June 7, 2014.

Citation preview

Page 1: I Don’t Have Time for Metadata!

I DON’T

HAV

E TIM

E FOR

METADAT

A!

SL A

20

14

Bob KasenchakProject CoordinatorAccess [email protected]@taxobob

Page 2: I Don’t Have Time for Metadata!

DISCLAIMER

I Don’t Have Time for Metadata!

Page 3: I Don’t Have Time for Metadata!

OUTLINE

• Data• Structured Data• Unstructured Data

• Metadata• Subject Metadata• Entity (author, institution) Metadata• Document Type Metadata

• Automating Metadata• Heuristic/Statistical/Inferential• Rule-based

I Don’t Have Time for Metadata!

Page 4: I Don’t Have Time for Metadata!

CASE STUDIES

I Don’t Have Time for Metadata!

Page 5: I Don’t Have Time for Metadata!

STRUCTURED VS. UNSTRUCTURED DATA

Present different problems – and possible solutions – for automatically adding metadata

I Don’t Have Time for Metadata!

Page 6: I Don’t Have Time for Metadata!

STRUCTURED VS. UNSTRUCTURED DATA

I Don’t Have Time for Metadata!

Association, in view of abuses and lack of consistency in published reports, has asserted that the all-inclusive income statement, containing all income items recognized as determinants of net income, is the answer to these questions.2 The Securities and Exchange Commission has alsostrongly favored this solution. 3 On the 1 Committee on Accounting Procedure, AmericanInstitute of Accountants, "Income and Earned Surplus," Accounting Research Bulletin No. 32 (December, 1947). 2 (1) "A Tentative Statement of Accounting Principles Affecting Corporate Reports," THE ACCOUNTING REvIEw, June, 1936, pp. 187-191; (2) Accounting

UNSTRUCTURED

Page 7: I Don’t Have Time for Metadata!

STRUCTURED VS. UNSTRUCTURED DATA

I Don’t Have Time for Metadata!

<volume>325</volume><issue>5945</issue><fpage seq="c">1206</fpage><lpage>1206</lpage><history><date date-type="received"><day>26</day><month>02</month><year>2009</year></date><date date-type="accepted"><day>11</day><month>08</month><year>2009</year></date></history><permissions><copyright-statement>Copyright &#x00A9; 2009</copyright-statement><copyright-year>2009</copyright-year><copyright-holder>Your name here</copyright-holder></permissions><abstract><p>Our extended ontogenetic growth model is a theoretical model based on conservationof energy and general biological mechanisms underlying ontogenetic growth. We do notbelieve that the comments of Makarieva <italic>et al</italic>. and Sousa <italic>et al</italic>. expose substantive problems with our model. Nevertheless, they raiseinteresting, still unresolved questions and point to philosophical differences about the roleof theory and of simple, general models as opposed to complicated, specific models.</p></abstract>

STRUCTURED

Page 8: I Don’t Have Time for Metadata!

STRUCTURED VS. UNSTRUCTURED DATA

• Just extracting basic information• Author• Institution• Title• Document type• Accession number(s)

…can be a challenge.

However…

I Don’t Have Time for Metadata!

Page 9: I Don’t Have Time for Metadata!

STRUCTURED VS. UNSTRUCTURED DATA

• Predictability

• Positionality

I Don’t Have Time for Metadata!

Journal name/Issue/Vol./etc.

Article Title

Copyright info

Author info

Abstract

Page 10: I Don’t Have Time for Metadata!

UNSTRUCTURED DATA => STRUCTURED DATA!

<journal>Transactions on Vehicular Technology</journal>

<article-title>Relationship of Average Transmitted and Received Energies in Adaptive Transmission</article-title>

<authors><author-surname>Kotelba</author-surname><author-firstname>Adrian</author-firstname><affiliation>Member, IEEE</affiliation></authors>

<copyright-info><copyright-date>2009</copyright-date></copyright-info>

<abstract><p>This paper studies the…</p></abstract>

NOTE: Some cleanup may be required

I Don’t Have Time for Metadata!

Page 11: I Don’t Have Time for Metadata!

STRUCTURED VS. UNSTRUCTURED DATA

• Basic information already tagged, labeled, and easy to extract• Author info• Title• Journal/Volume/Issue etc.

• We can add semantic (or subject) metadata• Targeting only those parts of the text we require• Title• Abstract• Full text body• Exclude references, etc.

I Don’t Have Time for Metadata!

Page 12: I Don’t Have Time for Metadata!

SEMANTIC METADATA

Uncontrolled Automatic keyword extraction Crowdsourced/folksonomic tags

Controlled – from a Thesaurus (or Taxonomy…) Inferential (Heuristic; Statistical) Rule-based

I Don’t Have Time for Metadata!

Page 13: I Don’t Have Time for Metadata!

SEMANTIC METADATA: HOW?

Controlled – from a Thesaurus (or Taxonomy…) Inferential (Heuristic; Statistical) Rule-based

Manual tagging Automatic tagging

I Don’t Have Time for Metadata!

Page 14: I Don’t Have Time for Metadata!

SEMANTIC METADATA: MANUAL ENTRY

I Don’t Have Time for Metadata!

Page 15: I Don’t Have Time for Metadata!

SEMANTIC METADATA: MANUAL ENTRY

I Don’t Have Time for Metadata!

A Thought Experiment

• Let’s say a manual indexer can index 10 records/hour• Let’s say the manual indexers are perfectly consistent (they’re

not)• Let’s say your manual indexers are paid $10/hour (good luck with

that)

If you have 10,000 articles/pieces of content:It would take a manual indexer 1000 hours (25 weeks) and cost $10,000

If you have 100,000 articles:It would take a manual indexer 10,000 hours (250 weeks, or almost 5 years) and cost $100,000

If you have 1,000,000 articles:It would take a manual indexer 100,000 hours (~48 years) and $1,000,000

Page 16: I Don’t Have Time for Metadata!

SEMANTIC METADATA: AUTOMATED

I Don’t Have Time for Metadata!

Page 17: I Don’t Have Time for Metadata!

SEMANTIC METADATA: WHY?

Disambiguate the ambiguous Specify most specific topics Improve information retrieval

Search Browse

Enable advanced analytics

I Don’t Have Time for Metadata!

Page 18: I Don’t Have Time for Metadata!

SEMANTIC METADATA: DISAMBIGUATION

“Mercury”

I Don’t Have Time for Metadata!

?

Page 19: I Don’t Have Time for Metadata!

SEMANTIC METADATA: SPECIFICATION

Beyond exact string matches: Synonymy

Fiber optic gyroscopes Fiber optic gyrosFiber-optic gyroscopes Fiber-optic gyrosFibre optic gyroscopes Fibre optic gyrosFibre-optic gyroscopes Fibre-optic gyrosFiberoptic gyroscopes Fiberoptic gyrosOptical fiber gyroscopes Optical fiber gyrosOptical fibre gyroscopes Optical fibre gyrosFOGs FOG’s

I Don’t Have Time for Metadata!

Page 20: I Don’t Have Time for Metadata!

SEMANTIC METADATA: SPECIFICATION

Beyond exact string matches: Context. Matters.

Indexing to most specific term

- Microscopes- Electron microscopes

- Scanning electron microscopes

I Don’t Have Time for Metadata!

Page 21: I Don’t Have Time for Metadata!

SEMANTIC METADATA: WHY?

Improving information retrieval (Search, Browse)

SEARCH ≠ BROWSE

I Don’t Have Time for Metadata!

Page 22: I Don’t Have Time for Metadata!

SEMANTIC METADATA: WHY?

Improving information retrieval: Search

Allows user to search by tags Ensures consistent and reliable retrieval Speeds electronic search

I Don’t Have Time for Metadata!

Page 23: I Don’t Have Time for Metadata!

SEMANTIC METADATA: WHY?

Improving information retrieval: Search

I Don’t Have Time for Metadata!

SubjectMetadata

Page 24: I Don’t Have Time for Metadata!

SEMANTIC METADATA: WHY?

Improving information retrieval: Search

I Don’t Have Time for Metadata!

Metadata-basedSearch

ResultsBased onmetadata

Page 25: I Don’t Have Time for Metadata!

SEMANTIC METADATA: WHY?

Improving information retrieval: Browse

I Don’t Have Time for Metadata!

Taxonomybrowse

ResultsBased onmetadata

Page 26: I Don’t Have Time for Metadata!

SEMANTIC METADATA: WHY?

Improving information retrieval: Browse

I Don’t Have Time for Metadata!

Taxonomybrowse

AdditionalSearchfilters

Page 27: I Don’t Have Time for Metadata!

SEMANTIC METADATA: WHY?

Improving information retrieval: Analytics

Combine subject metadata with metadata about Authors Institutions Publications (Journals, Magazines, etc.) Publication Types

…to create detailed informatics about your data, users, authors, and whatever else is relevant or useful

I Don’t Have Time for Metadata!

Page 28: I Don’t Have Time for Metadata!

SEMANTIC METADATA: WHY?

Improving information retrieval: Analytics

I Don’t Have Time for Metadata!

Taxonomyterm

Narrowerterms

BroaderTerm(s)

Authors who publishon this topic

Page 29: I Don’t Have Time for Metadata!

I DON’T HAVE TIME FOR METADATA!

I Don’t Have Time for Metadata!

Since Metadata allows you to do things you already have

want

need to do:

It’s always time for metadata.

Page 30: I Don’t Have Time for Metadata!

I DON’T

HAV

E

IT’S

ALW

AYS T

IME F

OR

METADAT

A!

SL A

20

14

Bob KasenchakProject CoordinatorAccess [email protected]@taxobob

Thank you!