Upload
karen-wyatt
View
25
Download
2
Embed Size (px)
DESCRIPTION
Data Mining, Advanced Collection Analysis, and Publisher Profiles: An Update on the OCLC Publisher Name Authority File. Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Research Timothy J. Dickey, Ph.D. Post-Doctoral Researcher OCLC Research. Overall Research Goals. - PowerPoint PPT Presentation
Citation preview
Charleston Conference
7 November 2008
Lynn Silipigni Connaway, Ph.D.Senior Research ScientistOCLC Research
Timothy J. Dickey, Ph.D.Post-Doctoral ResearcherOCLC Research
Data Mining, Advanced Collection Analysis, and
Publisher Profiles: An Update on the OCLC
Publisher Name Authority File
Data Mining, Advanced Collection Analysis, and
Publisher Profiles: An Update on the OCLC
Publisher Name Authority File
Overall Research GoalsOverall Research Goals
To Build a Database that Will:
Identify
• Authoritative strings for publisher names
• Common variants for names and locations
• Hierarchical references indicating relationships and nesting of subsidiaries
• Definitions of publishing entities
Overall Research GoalsOverall Research Goals
To Build a Database that Will:
Produce
• Profiles, including data-mined information regarding formats, languages, subjects, etc. for publishers
Conform
• to international authority and standards practice, and
• inter-operate with other OCLC products
Issues & ChallengesIssues & Challenges
Database Quality:
Historical Practices
• “…the shortest form in which it can be understood.” [AACR2 2004]
• Different versions of cataloging rules
• Abbreviations
Errors and misspellings
Local Practices
Method: Data Mining in an “Aggregate Collection”Method: Data Mining in an “Aggregate Collection”
Data Mining and Analysis of WorldCat:
“…affords high-level perspective on historical patterns, suggests future trends, and supplies useful intelligence with which to inform decision making.”
Lavoie, B.F., Connaway, L. S., & O’Neill, E. T. (2007). Mapping WorldCat’s digital landscape. Library Resources & Technical Services, 51, 106-115 at 107.
WorldCat: July 2008WorldCat: July 2008
Total holdings: 1,292,763,300 Total holdings: 1,292,763,300
Manifestations (records): 108,828,533Manifestations (records): 108,828,533
Works: 84,096,107Works: 84,096,107
Digital Items: 3,182,550Digital Items: 3,182,550 Institutions: 69,000Institutions: 69,000
Physical Items: ~1.2 billionPhysical Items: ~1.2 billion
Global Origins of WorldCat MaterialsGlobal Origins of WorldCat Materials
US28%
UK8%
Canada3%
Rest of World27%
Unknown17%
France4%
Germany10%
Global Origins of WorldCat MaterialsGlobal Origins of WorldCat Materials
Content Languages: 478
49% of WC non-English
Top 5 non-English:
German: 12 million
French: 6.1 million
Spanish: 3.5 million
Dutch: 2.6 million
Japanese: 2.4 million
Content Languages: 478
49% of WC non-English
Top 5 non-English:
German: 12 million
French: 6.1 million
Spanish: 3.5 million
Dutch: 2.6 million
Japanese: 2.4 million
Materials w/non-US origins:
57.9 million (55%)
Top 5:
Germany: 10.0 million
UK: 8.8 million
France: 4.2 million
Netherlands: 2.9 million
Canada: 2.9 million
Materials w/non-US origins:
57.9 million (55%)
Top 5:
Germany: 10.0 million
UK: 8.8 million
France: 4.2 million
Netherlands: 2.9 million
Canada: 2.9 million
Non-English Metadata Language:
28 million (66 languages)
Top 5:
German: 11 million French: 1.8 million
Dutch: 5.0 million Finnish: 0.7 million
Swedish: 1.9 million
Non-English Metadata Language:
28 million (66 languages)
Top 5:
German: 11 million French: 1.8 million
Dutch: 5.0 million Finnish: 0.7 million
Swedish: 1.9 million
OCLC Publisher Name ServerOCLC Publisher Name Server
Publisher Name Server: ObjectivesPublisher Name Server: Objectives
Resolve for data mining and quality of WorldCat
• ISBN prefixes to publisher name
• Variant publisher names to a preferred form
Complement Collection Analysis Service
• Librarians & Publishers
Publisher Name Server: ObjectivesPublisher Name Server: Objectives
Capture and profile attributes of individual publishers:
• Location(s)
• Language(s) of materials published
• Genre(s)/format(s)
• Dominant subject domain(s)
• Parent company and subsidiaries
Publisher Name Server: MethodologyPublisher Name Server: Methodology
Programmatically cluster publishers’ records using ISBN prefixes
• Data clustering
• Classification of similar objects into different groups
• Partitioning of a data set into subsets (clusters)
Hand parse the entities and resolve ISBN prefixes
Publisher Name Server: DatabasePublisher Name Server: Database
1750 publishing entities
Relational database, preserving hierarchical relationships
Begins with high-occurrence entities:
• “Top 10” lists
• Top 10 university presses
• Mergers and acquisitions, last 8 years
Example: Top U.S. Publishing Entities by ISBNExample: Top U.S. Publishing Entities by ISBN
ISBN Prefix
WorldCat Records
Publishing Entity
0-13 50,298 Prentice-Hall, Inc.
0-07 44,545 McGraw Hill, Inc.
0-06 44,362 HarperCollins (Firm)
0-16 40,451 United States G.P.O.
0-471 37,710 John Wiley & Sons
0-312 33,318 St. Martin's Press
0-671 31,765 Simon & Schuster, Inc.
0-02 27,602 MacMillan Publishers
0-15 18,420 Harcourt Brace & Company
0-394 18,043 Random House (Firm)
0-590 17,290 Scholastic Inc.
0-385 16,768 Doubleday and Company, Inc.
0-395 16,699 Houghton Mifflin Company
0-19 15,724 Oxford University Press
0-03 15,417 Holt, Rinehart, and Winston
Publisher Name Server: Data CapturedPublisher Name Server: Data Captured
Data:
Publisher Name, Preferred Form
Source of Preferred Form
Former Names
Variant Forms
ISBN Prefixes
HQ City
HQ Country
Other Cities
URL
-----
Languages
Formats
Conspectus Subjects
Sources:
U.S. Library of Congress, National Authority File, 110 (Corporate Name) field
Books In Print Online (W.W. Bowker)
The International ISBN Registry (K.G. Saur)
Publishers’ Weekly Online
Hoover’s Handbook Online
Standard and Poor’s Corporate Descriptions
The Directory of Corporate Affiliations (DIALOG)
Company websites
DATA MINING
Publisher Name Server: Current ScopePublisher Name Server: Current Scope
More than 56,000 separate strings mapped to 1750 entities
• 8.5 million OCLC records
• 22% of these are Library of Congress records
• ~490 million holdings
Hierarchical relationships maintained
Entity-Parsing in a World of Mergers and AcquisitionsEntity-Parsing in a World of Mergers and Acquisitions
Prentice-Hall, Inc.
Pearson Education, Inc.
Addison-Wesley Publishing Company
Allyn and Bacon Dominie Press
Benjamin/Cummings Publishing Company
Scott, Foresman and Company
HarperCollins Educational Publishers
Longmans, Green, and Co.
Pearson PLC
Pearson Canada Pearson Technology Group
Copp Clark Adobe Press Cisco Press
Penguin Books
Allen Lane Ladybird Books Riverhead Books
Puffin Books Putnam Books Berkeley Publishing Group
Avery
Publisher Profiles within WorldCatPublisher Profiles within WorldCat
Oxford University Press
• 119,237 records with ISBNs mapped to 210,095 records (0.19% of WorldCat)
Pearson PLC
• Includes 14 subsidiaries and acquisitions
• Aggregate: 291,433 records (0.27% of WorldCat)
Springer (Firm)
• 197,263 records (0.18% of WorldCat)
Reed Elsevier PLC
• Includes dozens of subsidiaries
• Aggregate: 370,029 records (0.34% of WorldCat)
WorldCat Publisher Profiles – Top Languages WorldCat Publisher Profiles – Top Languages
Oxford Univ. Press:
English 96.74%
Latin 0.51%
German 0.39%
Chinese 0.39%
French 0.37%
Spanish 0.28%
Afrikaans 0.14%
Middle English 0.13%
Malay 0.09%
Swahili 0.09%
Pearson PLC:
English 95.27%
Spanish 1.43%
German 1.33%
French 0.60%
Dutch 0.55%
Latin 0.26%
Malay 0.06%
Ancient Greek 0.05%
Portuguese 0.05%
Italian 0.04%
WorldCat Publisher Profiles – Top Languages WorldCat Publisher Profiles – Top Languages
Springer (Firm):
English 61.25%
German 37.10%
French 1.02%
Italian 0.29%
Polish 0.13%
Czech 0.04%
Spanish 0.04%
Hungarian 0.03%
Dutch 0.02%
Danish 0.02%
Reed Elsevier PLC:
English 83.64%
French 9.34%
Dutch 2.32%
Spanish 0.95%
Italian 0.60%
Latin 0.27%
Afrikaans 0.16%
Ancient Greek 0.12%
Portuguese 0.09%
Polish 0.06%
WorldCat Publisher Profiles - FormatsWorldCat Publisher Profiles - Formats
Oxford University Press:
Printed Material 89.57%
Computer File 8.23%
Microform 1.39%
Sound Recording 0.50%
Video Recording 0.16%
Springer (Firm):
Printed Material 81.69%
Computer file 17.51%
Microform 0.71%
Video Recording 0.05%
Pearson PLC:
Printed Material92.98%
Microform 2.82%
Computer File 2.15%
Video Recording 0.70%
Sound Recording 0.67%
Reed Elsevier PLC:
Printed Material92.31%
Computer File 5.46%
Microform 1.85%
Video Recording 0.14%
WorldCat Publisher Profiles – Conspectus DivisionsWorldCat Publisher Profiles – Conspectus Divisions
Oxford Univ. Press:
Language/ Literature 27.12%
History 11.92%
Music 9.78%
Philosophy/ Religion 9.55%
Business/ Economics 6.15%
Medicine 4.36%
Law 3.85%
Sociology 3.75%
Political Science 3.58%
Biology 2.60%
Pearson PLC:
Language/ Literature 18.67%
Business/ Economics 13.30%
Computer Science 9.42%
Engineering 8.04%
History 7.59%
Mathematics 6.04%
Education 5.64%
Sociology 4.18%
Philosophy/ Religion 3.81%
Physical Sciences 2.75%
WorldCat Publisher Profiles – Conspectus CategoriesWorldCat Publisher Profiles – Conspectus Categories
Oxford Univ. Press:
English literature 10.66%
English language 5.86%
Instrumental music 3.48%
Vocal music 3.09%
Literature on music 2.26%
History – Britain 1.82%
Economic history 1.38%
American lit. 1.35%
History – S. Asia 1.30%
General history 1.29%
Pearson PLC:
English language 7.74%
Business admin. 4.62%
English literature 3.63%
Economics 2.94%
Comp. programming 2.39%
Electrical engineering 2.24%
Early childhood ed. 2.05%
Computer software 1.88%
U.S. federal law 1.80%
Computer Science 1.54%
WorldCat Publisher Profiles – Conspectus SubjectsWorldCat Publisher Profiles – Conspectus Subjects
Oxford Univ. Press:
English – modern 5.57%
English lit. – prose 2.51%
English lit. – 19th c. 2.23%
Juvenile lit. 1.06%
English lit. – poetry 1.03%
English lit. – collections 0.80%
Biographies 0.76%
English lit. – 1900-1960 0.74%
Shakespeare 0.68%
Sacred choruses 0.66%
Pearson PLC:
English – modern 7.68%
Management 2.53%
Programming 1.74%
Arithmetic 1.09%
Economic theory 1.06%
Marketing 1.06%
General algebra 1.04%
Accounting 0.97%
Juvenile lit. 0.93%
English lit. – 19th c. 0.89%
WorldCat Publisher Profiles – Conspectus DivisionsWorldCat Publisher Profiles – Conspectus Divisions
Springer (Firm):
Computer Science 16.83%
Engineering 15.12%
Mathematics 12.96%
Medicine 9.93%
Physical Sciences 9.83%
Biology 5.22%
Business/ Economics 5.13%
Health Professions 4.48%
Chemistry 3.14%
Geography 2.58%
Reed Elsevier PLC:
Language/ Literature 14.18%
Law 11.78%
Engineering 11.73%
Business/ Economics 6.82%
Medicine 6.50%
Physical Sciences 5.01%
History 4.57%
Biology 4.32%
Health Professions 3.70%
Chemistry 3.51%
WorldCat Publisher Profiles – Conspectus CategoriesWorldCat Publisher Profiles – Conspectus Categories
Springer (Firm):
Computer science 5.23%
General math 4.48%
Health professions 4.03%
Electrical engineering 3.73%
General engineering 3.25%
Mathematical analysis 3.06%
Computer software 2.37%
Comp. programming 2.34%
Probability/ Statistics 2.20%
Mech. engineering 2.17%
Reed Elsevier PLC:
English literature 5.84%
Health professions 3.40%
English language 2.79%
U.S. federal law 2.32%
General engineering 2.26%
Electrical engineering 2.10%
General law 1.70%
Industrial economics 1.65%
Business admin. 1.53%
U.S. state law 1.46%
WorldCat Publisher Profiles – Conspectus SubjectsWorldCat Publisher Profiles – Conspectus Subjects
Springer (Firm):
Health professions 3.56%
Math collections 2.76%
Computer science 1.84%
Programming 1.46%
Access/ security 1.10%
Artificial intelligence 1.03%
Mathematical stats 1.03%
Analytical physics 1.02%
Industrial management 0.99%
Engineering materials 0.90%
Reed Elsevier PLC:
English – modern 2.68%
English - prose 2.06%
Health professions 1.92%
U.S. state law 1.37%
Industrial management 1.22%
Legal periodicals 1.16%
English lit. - 1900-1960 1.15%
Engineering materials 0.86%
English fiction 0.83%
Nuclear physics 0.68%
Projected MARC coding of Authorized FormsProjected MARC coding of Authorized Forms
710 Added Entry – Corporate Name
• Add $4 for publisher name
• Add $2 NAF where preferred form matches existing authority record (44% of current PNAF)
752 Added Entry – Hierarchical Place Name
• Add $2 FAST where place of publication matches FAST geographical subject headings
Ongoing ResearchOngoing Research
Further data mining
• Profile other aspects of publication output
• Profile other publishers
• Trends over time
• Author clusters
• Geographic holdings patterns
• Collection Analysis
Ongoing ResearchOngoing Research
Plan for long-term maintenance
• ISBN-13 compliance
• File expansion of ongoing mergers/ acquisition activities
• Deeper scaling into WorldCat (beyond ISBN)
OCLC Publisher Name ServerOCLC Publisher Name Server
Project page:
http://www.oclc.org/research/projects/publisherns/
Thank You!Thank You!
Questions and Discussion
Lynn Silipigni Connaway [email protected] J. Dickey [email protected]