33
Charleston Conference 7 November 2008 Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Research Timothy J. Dickey, Ph.D. Post-Doctoral Researcher OCLC Research Data Mining, Advanced Collection Analysis, and Publisher Profiles: An Update on the OCLC Publisher Name Authority File

Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Research Timothy J. Dickey, Ph.D

Embed Size (px)

DESCRIPTION

Data Mining, Advanced Collection Analysis, and Publisher Profiles: An Update on the OCLC Publisher Name Authority File. Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Research Timothy J. Dickey, Ph.D. Post-Doctoral Researcher OCLC Research. Overall Research Goals. - PowerPoint PPT Presentation

Citation preview

Page 1: Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Research Timothy J. Dickey, Ph.D

Charleston Conference

7 November 2008

Lynn Silipigni Connaway, Ph.D.Senior Research ScientistOCLC Research

Timothy J. Dickey, Ph.D.Post-Doctoral ResearcherOCLC Research

Data Mining, Advanced Collection Analysis, and

Publisher Profiles: An Update on the OCLC

Publisher Name Authority File

Data Mining, Advanced Collection Analysis, and

Publisher Profiles: An Update on the OCLC

Publisher Name Authority File

Page 2: Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Research Timothy J. Dickey, Ph.D

Overall Research GoalsOverall Research Goals

To Build a Database that Will:

Identify

• Authoritative strings for publisher names

• Common variants for names and locations

• Hierarchical references indicating relationships and nesting of subsidiaries

• Definitions of publishing entities

Page 3: Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Research Timothy J. Dickey, Ph.D

Overall Research GoalsOverall Research Goals

To Build a Database that Will:

Produce

• Profiles, including data-mined information regarding formats, languages, subjects, etc. for publishers

Conform

• to international authority and standards practice, and

• inter-operate with other OCLC products

Page 4: Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Research Timothy J. Dickey, Ph.D

Issues & ChallengesIssues & Challenges

Database Quality:

Historical Practices

• “…the shortest form in which it can be understood.” [AACR2 2004]

• Different versions of cataloging rules

• Abbreviations

Errors and misspellings

Local Practices

Page 5: Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Research Timothy J. Dickey, Ph.D

Method: Data Mining in an “Aggregate Collection”Method: Data Mining in an “Aggregate Collection”

Data Mining and Analysis of WorldCat:

“…affords high-level perspective on historical patterns, suggests future trends, and supplies useful intelligence with which to inform decision making.”

Lavoie, B.F., Connaway, L. S., & O’Neill, E. T. (2007). Mapping WorldCat’s digital landscape. Library Resources & Technical Services, 51, 106-115 at 107.

Page 6: Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Research Timothy J. Dickey, Ph.D

WorldCat: July 2008WorldCat: July 2008

Total holdings: 1,292,763,300 Total holdings: 1,292,763,300

Manifestations (records): 108,828,533Manifestations (records): 108,828,533

Works: 84,096,107Works: 84,096,107

Digital Items: 3,182,550Digital Items: 3,182,550 Institutions: 69,000Institutions: 69,000

Physical Items: ~1.2 billionPhysical Items: ~1.2 billion

Page 7: Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Research Timothy J. Dickey, Ph.D

Global Origins of WorldCat MaterialsGlobal Origins of WorldCat Materials

US28%

UK8%

Canada3%

Rest of World27%

Unknown17%

France4%

Germany10%

Page 8: Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Research Timothy J. Dickey, Ph.D

Global Origins of WorldCat MaterialsGlobal Origins of WorldCat Materials

Content Languages: 478

49% of WC non-English

Top 5 non-English:

German: 12 million

French: 6.1 million

Spanish: 3.5 million

Dutch: 2.6 million

Japanese: 2.4 million

Content Languages: 478

49% of WC non-English

Top 5 non-English:

German: 12 million

French: 6.1 million

Spanish: 3.5 million

Dutch: 2.6 million

Japanese: 2.4 million

Materials w/non-US origins:

57.9 million (55%)

Top 5:

Germany: 10.0 million

UK: 8.8 million

France: 4.2 million

Netherlands: 2.9 million

Canada: 2.9 million

Materials w/non-US origins:

57.9 million (55%)

Top 5:

Germany: 10.0 million

UK: 8.8 million

France: 4.2 million

Netherlands: 2.9 million

Canada: 2.9 million

Non-English Metadata Language:

28 million (66 languages)

Top 5:

German: 11 million French: 1.8 million

Dutch: 5.0 million Finnish: 0.7 million

Swedish: 1.9 million

Non-English Metadata Language:

28 million (66 languages)

Top 5:

German: 11 million French: 1.8 million

Dutch: 5.0 million Finnish: 0.7 million

Swedish: 1.9 million

Page 9: Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Research Timothy J. Dickey, Ph.D

OCLC Publisher Name ServerOCLC Publisher Name Server

Page 10: Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Research Timothy J. Dickey, Ph.D

Publisher Name Server: ObjectivesPublisher Name Server: Objectives

Resolve for data mining and quality of WorldCat

• ISBN prefixes to publisher name

• Variant publisher names to a preferred form

Complement Collection Analysis Service

• Librarians & Publishers

Page 11: Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Research Timothy J. Dickey, Ph.D

Publisher Name Server: ObjectivesPublisher Name Server: Objectives

Capture and profile attributes of individual publishers:

• Location(s)

• Language(s) of materials published

• Genre(s)/format(s)

• Dominant subject domain(s)

• Parent company and subsidiaries

Page 12: Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Research Timothy J. Dickey, Ph.D

Publisher Name Server: MethodologyPublisher Name Server: Methodology

Programmatically cluster publishers’ records using ISBN prefixes

• Data clustering

• Classification of similar objects into different groups

• Partitioning of a data set into subsets (clusters)

Hand parse the entities and resolve ISBN prefixes

Page 13: Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Research Timothy J. Dickey, Ph.D

Publisher Name Server: DatabasePublisher Name Server: Database

1750 publishing entities

Relational database, preserving hierarchical relationships

Begins with high-occurrence entities:

• “Top 10” lists

• Top 10 university presses

• Mergers and acquisitions, last 8 years

Page 14: Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Research Timothy J. Dickey, Ph.D

Example: Top U.S. Publishing Entities by ISBNExample: Top U.S. Publishing Entities by ISBN

ISBN Prefix

WorldCat Records

Publishing Entity

0-13 50,298 Prentice-Hall, Inc.

0-07 44,545 McGraw Hill, Inc.

0-06 44,362 HarperCollins (Firm)

0-16 40,451 United States G.P.O.

0-471 37,710 John Wiley & Sons

0-312 33,318 St. Martin's Press

0-671 31,765 Simon & Schuster, Inc.

0-02 27,602 MacMillan Publishers

0-15 18,420 Harcourt Brace & Company

0-394 18,043 Random House (Firm)

0-590 17,290 Scholastic Inc.

0-385 16,768 Doubleday and Company, Inc.

0-395 16,699 Houghton Mifflin Company

0-19 15,724 Oxford University Press

0-03 15,417 Holt, Rinehart, and Winston

Page 15: Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Research Timothy J. Dickey, Ph.D

Publisher Name Server: Data CapturedPublisher Name Server: Data Captured

Data:

Publisher Name, Preferred Form

Source of Preferred Form

Former Names

Variant Forms

ISBN Prefixes

HQ City

HQ Country

Other Cities

URL

-----

Languages

Formats

Conspectus Subjects

Sources:

U.S. Library of Congress, National Authority File, 110 (Corporate Name) field

Books In Print Online (W.W. Bowker)

The International ISBN Registry (K.G. Saur)

Publishers’ Weekly Online

Hoover’s Handbook Online

Standard and Poor’s Corporate Descriptions

The Directory of Corporate Affiliations (DIALOG)

Company websites

DATA MINING

Page 16: Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Research Timothy J. Dickey, Ph.D
Page 17: Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Research Timothy J. Dickey, Ph.D

Publisher Name Server: Current ScopePublisher Name Server: Current Scope

More than 56,000 separate strings mapped to 1750 entities

• 8.5 million OCLC records

• 22% of these are Library of Congress records

• ~490 million holdings

Hierarchical relationships maintained

Page 18: Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Research Timothy J. Dickey, Ph.D

Entity-Parsing in a World of Mergers and AcquisitionsEntity-Parsing in a World of Mergers and Acquisitions

Prentice-Hall, Inc.

Pearson Education, Inc.

Addison-Wesley Publishing Company

Allyn and Bacon Dominie Press

Benjamin/Cummings Publishing Company

Scott, Foresman and Company

HarperCollins Educational Publishers

Longmans, Green, and Co.

Pearson PLC

Pearson Canada Pearson Technology Group

Copp Clark Adobe Press Cisco Press

Penguin Books

Allen Lane Ladybird Books Riverhead Books

Puffin Books Putnam Books Berkeley Publishing Group

Avery

Page 19: Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Research Timothy J. Dickey, Ph.D

Publisher Profiles within WorldCatPublisher Profiles within WorldCat

Oxford University Press

• 119,237 records with ISBNs mapped to 210,095 records (0.19% of WorldCat)

Pearson PLC

• Includes 14 subsidiaries and acquisitions

• Aggregate: 291,433 records (0.27% of WorldCat)

Springer (Firm)

• 197,263 records (0.18% of WorldCat)

Reed Elsevier PLC

• Includes dozens of subsidiaries

• Aggregate: 370,029 records (0.34% of WorldCat)

Page 20: Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Research Timothy J. Dickey, Ph.D

WorldCat Publisher Profiles – Top Languages WorldCat Publisher Profiles – Top Languages

Oxford Univ. Press:

English 96.74%

Latin 0.51%

German 0.39%

Chinese 0.39%

French 0.37%

Spanish 0.28%

Afrikaans 0.14%

Middle English 0.13%

Malay 0.09%

Swahili 0.09%

Pearson PLC:

English 95.27%

Spanish 1.43%

German 1.33%

French 0.60%

Dutch 0.55%

Latin 0.26%

Malay 0.06%

Ancient Greek 0.05%

Portuguese 0.05%

Italian 0.04%

Page 21: Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Research Timothy J. Dickey, Ph.D

WorldCat Publisher Profiles – Top Languages WorldCat Publisher Profiles – Top Languages

Springer (Firm):

English 61.25%

German 37.10%

French 1.02%

Italian 0.29%

Polish 0.13%

Czech 0.04%

Spanish 0.04%

Hungarian 0.03%

Dutch 0.02%

Danish 0.02%

Reed Elsevier PLC:

English 83.64%

French 9.34%

Dutch 2.32%

Spanish 0.95%

Italian 0.60%

Latin 0.27%

Afrikaans 0.16%

Ancient Greek 0.12%

Portuguese 0.09%

Polish 0.06%

Page 22: Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Research Timothy J. Dickey, Ph.D

WorldCat Publisher Profiles - FormatsWorldCat Publisher Profiles - Formats

Oxford University Press:

Printed Material 89.57%

Computer File 8.23%

Microform 1.39%

Sound Recording 0.50%

Video Recording 0.16%

Springer (Firm):

Printed Material 81.69%

Computer file 17.51%

Microform 0.71%

Video Recording 0.05%

Pearson PLC:

Printed Material92.98%

Microform 2.82%

Computer File 2.15%

Video Recording 0.70%

Sound Recording 0.67%

Reed Elsevier PLC:

Printed Material92.31%

Computer File 5.46%

Microform 1.85%

Video Recording 0.14%

Page 23: Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Research Timothy J. Dickey, Ph.D

WorldCat Publisher Profiles – Conspectus DivisionsWorldCat Publisher Profiles – Conspectus Divisions

Oxford Univ. Press:

Language/ Literature 27.12%

History 11.92%

Music 9.78%

Philosophy/ Religion 9.55%

Business/ Economics 6.15%

Medicine 4.36%

Law 3.85%

Sociology 3.75%

Political Science 3.58%

Biology 2.60%

Pearson PLC:

Language/ Literature 18.67%

Business/ Economics 13.30%

Computer Science 9.42%

Engineering 8.04%

History 7.59%

Mathematics 6.04%

Education 5.64%

Sociology 4.18%

Philosophy/ Religion 3.81%

Physical Sciences 2.75%

Page 24: Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Research Timothy J. Dickey, Ph.D

WorldCat Publisher Profiles – Conspectus CategoriesWorldCat Publisher Profiles – Conspectus Categories

Oxford Univ. Press:

English literature 10.66%

English language 5.86%

Instrumental music 3.48%

Vocal music 3.09%

Literature on music 2.26%

History – Britain 1.82%

Economic history 1.38%

American lit. 1.35%

History – S. Asia 1.30%

General history 1.29%

Pearson PLC:

English language 7.74%

Business admin. 4.62%

English literature 3.63%

Economics 2.94%

Comp. programming 2.39%

Electrical engineering 2.24%

Early childhood ed. 2.05%

Computer software 1.88%

U.S. federal law 1.80%

Computer Science 1.54%

Page 25: Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Research Timothy J. Dickey, Ph.D

WorldCat Publisher Profiles – Conspectus SubjectsWorldCat Publisher Profiles – Conspectus Subjects

Oxford Univ. Press:

English – modern 5.57%

English lit. – prose 2.51%

English lit. – 19th c. 2.23%

Juvenile lit. 1.06%

English lit. – poetry 1.03%

English lit. – collections 0.80%

Biographies 0.76%

English lit. – 1900-1960 0.74%

Shakespeare 0.68%

Sacred choruses 0.66%

Pearson PLC:

English – modern 7.68%

Management 2.53%

Programming 1.74%

Arithmetic 1.09%

Economic theory 1.06%

Marketing 1.06%

General algebra 1.04%

Accounting 0.97%

Juvenile lit. 0.93%

English lit. – 19th c. 0.89%

Page 26: Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Research Timothy J. Dickey, Ph.D

WorldCat Publisher Profiles – Conspectus DivisionsWorldCat Publisher Profiles – Conspectus Divisions

Springer (Firm):

Computer Science 16.83%

Engineering 15.12%

Mathematics 12.96%

Medicine 9.93%

Physical Sciences 9.83%

Biology 5.22%

Business/ Economics 5.13%

Health Professions 4.48%

Chemistry 3.14%

Geography 2.58%

Reed Elsevier PLC:

Language/ Literature 14.18%

Law 11.78%

Engineering 11.73%

Business/ Economics 6.82%

Medicine 6.50%

Physical Sciences 5.01%

History 4.57%

Biology 4.32%

Health Professions 3.70%

Chemistry 3.51%

Page 27: Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Research Timothy J. Dickey, Ph.D

WorldCat Publisher Profiles – Conspectus CategoriesWorldCat Publisher Profiles – Conspectus Categories

Springer (Firm):

Computer science 5.23%

General math 4.48%

Health professions 4.03%

Electrical engineering 3.73%

General engineering 3.25%

Mathematical analysis 3.06%

Computer software 2.37%

Comp. programming 2.34%

Probability/ Statistics 2.20%

Mech. engineering 2.17%

Reed Elsevier PLC:

English literature 5.84%

Health professions 3.40%

English language 2.79%

U.S. federal law 2.32%

General engineering 2.26%

Electrical engineering 2.10%

General law 1.70%

Industrial economics 1.65%

Business admin. 1.53%

U.S. state law 1.46%

Page 28: Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Research Timothy J. Dickey, Ph.D

WorldCat Publisher Profiles – Conspectus SubjectsWorldCat Publisher Profiles – Conspectus Subjects

Springer (Firm):

Health professions 3.56%

Math collections 2.76%

Computer science 1.84%

Programming 1.46%

Access/ security 1.10%

Artificial intelligence 1.03%

Mathematical stats 1.03%

Analytical physics 1.02%

Industrial management 0.99%

Engineering materials 0.90%

Reed Elsevier PLC:

English – modern 2.68%

English - prose 2.06%

Health professions 1.92%

U.S. state law 1.37%

Industrial management 1.22%

Legal periodicals 1.16%

English lit. - 1900-1960 1.15%

Engineering materials 0.86%

English fiction 0.83%

Nuclear physics 0.68%

Page 29: Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Research Timothy J. Dickey, Ph.D

Projected MARC coding of Authorized FormsProjected MARC coding of Authorized Forms

710 Added Entry – Corporate Name

• Add $4 for publisher name

• Add $2 NAF where preferred form matches existing authority record (44% of current PNAF)

752 Added Entry – Hierarchical Place Name

• Add $2 FAST where place of publication matches FAST geographical subject headings

Page 30: Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Research Timothy J. Dickey, Ph.D

Ongoing ResearchOngoing Research

Further data mining

• Profile other aspects of publication output

• Profile other publishers

• Trends over time

• Author clusters

• Geographic holdings patterns

• Collection Analysis

Page 31: Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Research Timothy J. Dickey, Ph.D

Ongoing ResearchOngoing Research

Plan for long-term maintenance

• ISBN-13 compliance

• File expansion of ongoing mergers/ acquisition activities

• Deeper scaling into WorldCat (beyond ISBN)

Page 32: Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Research Timothy J. Dickey, Ph.D

OCLC Publisher Name ServerOCLC Publisher Name Server

Project page:

http://www.oclc.org/research/projects/publisherns/

Page 33: Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Research Timothy J. Dickey, Ph.D

Thank You!Thank You!

Questions and Discussion

Lynn Silipigni Connaway [email protected] J. Dickey [email protected]