43
Crowdsourcing Digitization Harnessing Workflows to Increase Output Gretchen Gueguen, East Carolina University Ann Hanlon, Marquette University LITA National Forum, 2008 Cincinnati, Ohio

Crowdsourcing Digitization: Harnessing Workflows to Increase Output

Embed Size (px)

DESCRIPTION

Are the highly selective models of digital content creation satisfying user demands for increasing access to our vast collection holdings? In this era of decreasing library budgets and increasing responsibilities, is such a level of staffing possible at any but the well-funded libraries? As a recent article in the New York Times estimated, it would take 1,800 years for the National Archives to digitize its text holdings at the current rate of digitization1. Since November 2005, the University of Maryland libraries has engaged in another model for digitization: a workflow model that harnesses the digitization already being done by archivists and other staff for requests by patrons. By “crowdsourcing” selection decisions in this way the libraries have built a collection of over 5,000 objects from the holdings of the University Archives and Historical Manuscripts. This model is based on two main principles: · Selection: As one part of a programmatic approach to digitization, selections are based on user request and added to the publicly accessible digital repository · Image capture: Digitization itself proceeds on the premise that creating useful surrogates is more important than digital reformatting. The path to a successful workflow is fraught with perils, though. The presenters will discuss the issues that have proven most effective and most difficult in the large-scale digitization workflow in place at UM. They will highlight the technical requirements chosen for images, metadata, and quality control and speak about how they were, or in some cases were not, able to achieve them. In bringing to light these issues we hope to continue an ongoing conversation (most recently articulated at OCLC\'s \"Digitization Matters\" forum) about the purpose of digital collections and standards of digital surrogate creation, especially in the age of mass digitization projects. We hope to explore need to harness all of the library’s expertise and resources where they can best be deployed.

Citation preview

Page 1: Crowdsourcing Digitization: Harnessing Workflows to Increase Output

Crowdsourcing DigitizationHarnessing Workflows to

Increase Output

Gretchen Gueguen, East Carolina UniversityAnn Hanlon, Marquette University

LITA National Forum, 2008Cincinnati, Ohio

Page 2: Crowdsourcing Digitization: Harnessing Workflows to Increase Output

What is crowdsourcing?

• Jeff Howe, Wired Magazine, 2006– “distributed labor networks are using the Internet to exploit the

spare processing power of millions of human brains” – best example, Wikipedia…

– Any end achieved by harnessing the wisdom and labor of crowds

– Distributing the burden of a large endeavor

Howe, Jeff. “The Rise of Crowdsourcing”, Wired Magazine, Issue 14.06, June 2006

Page 3: Crowdsourcing Digitization: Harnessing Workflows to Increase Output

Crowdsourcing Digitization

• Crowd?– Patrons and Co-workers

• Capturing digitization for patron request– Selection is driven by patron request

• Centralized and Decentralized staffing for digitization

• Object: Build robust digital collections• Online collections dense enough for systematic

research (not just showcases)

Page 4: Crowdsourcing Digitization: Harnessing Workflows to Increase Output

Crowdsourcing Digitization

• The Wisdom of Crowds– How the project was conceived and developed: success story

• The Madness of Crowds– How the project failed, why: bringing it back from the brink

• Crowd Control– Methods used and lessons learned

• Attracting a Crowd– Critical mass for the masses: why we digitize

Page 5: Crowdsourcing Digitization: Harnessing Workflows to Increase Output

The Wisdom of Crowds

Page 6: Crowdsourcing Digitization: Harnessing Workflows to Increase Output

The Wisdom of Crowds

• Project Background: Archives and Special Collections– Digital image management for archives and special collections– Reducing redundancy – many items requested for digitization

more than once, why not track them?

• Digital Collections and Research (DCR)– New office to coordinate digitization efforts established– Establishing a digital repository – More ambitious than just image management

Image management = capturing patron scanning Image management = capturing patron scanning workflow to populate the new repositoryworkflow to populate the new repository

Page 7: Crowdsourcing Digitization: Harnessing Workflows to Increase Output

The Wisdom of Crowds

• Coordination between Archives and Digital Collections: – New metadata schema

– New best practice guidelines

• Developing Repository– Fedora required development

Meanwhile, patron scanning continues to grow…Meanwhile, patron scanning continues to grow…

Page 8: Crowdsourcing Digitization: Harnessing Workflows to Increase Output

The Wisdom of Crowds

• Answer: Scanning Database– Microsoft Access database: “stop-gap measure” while digital

repository was being built

– Corresponded to newly created XML schema and metadata requirements for repository

Page 9: Crowdsourcing Digitization: Harnessing Workflows to Increase Output

The Wisdom of Crowds

Page 10: Crowdsourcing Digitization: Harnessing Workflows to Increase Output

The Wisdom of Crowds

• Biggest beneficiary: University Archives– Receives the most scanning requests from patrons– Capture patron requests, as well as items scanned prior to

implementation of Scanning Database– University celebrating 150th anniversary

• Documentary• “Coffee table” book• Departmental histories• Nostalgic alumnae

Page 11: Crowdsourcing Digitization: Harnessing Workflows to Increase Output

The Wisdom of Crowds

• Collections created by crowdsourcing digitization:– University AlbUM– National Trust for Historic Preservation Postcard Collection

Page 12: Crowdsourcing Digitization: Harnessing Workflows to Increase Output

The Madness of Crowds

Page 13: Crowdsourcing Digitization: Harnessing Workflows to Increase Output

The Madness of Crowds

• Evolution– Evolving standards for both metadata and imaging

• Training/Quality• (dis)Organization• Backlog

www.funnyfreepics.com

Page 14: Crowdsourcing Digitization: Harnessing Workflows to Increase Output

The Madness of Crowds

• Evolution– Quality of legacy scans

• file types• spatial resolutions• Color profiles• Clipping, noise, and other

“problems”• Flawed equipment

• Training/Procedures• (dis)Organization

• Backlog

Page 15: Crowdsourcing Digitization: Harnessing Workflows to Increase Output

The Madness of CrowdsRotated 90º

Rotated 180º

24-bit color 300 dpi tif

8-bit 600 dpi tif

48-bit color 600 dpi tif

BitonalEPS

16-bit 300 dpi JPEG

indexed color 72 dpi gif

PDF???

Page 16: Crowdsourcing Digitization: Harnessing Workflows to Increase Output

The Madness of Crowds

Page 17: Crowdsourcing Digitization: Harnessing Workflows to Increase Output

• Evolution– Metadata Quality

• Lack of experience with controlled vocabularies and input standards

• Changing metadata requirements

• Training/Procedures• (dis)Organization• Backlog

The Madness of Crowds

It’s not quite wrong…

but, it’s not quite right

Page 18: Crowdsourcing Digitization: Harnessing Workflows to Increase Output

• Evolution

• Training/Procedures– No standard guidelines for scanning procedures

– No quality control procedures for images or metadata

– No one to set them up anyway…• (dis)Organization• Backlog

The Madness of Crowds

Page 19: Crowdsourcing Digitization: Harnessing Workflows to Increase Output

The Madness of Crowds

Page 20: Crowdsourcing Digitization: Harnessing Workflows to Increase Output

The Madness of Crowds

Page 21: Crowdsourcing Digitization: Harnessing Workflows to Increase Output

The Madness of Crowds

• Evolution

• Training/Procedures

• (dis)Organization― Does everything fit in

a “collection?• Backlog

Page 22: Crowdsourcing Digitization: Harnessing Workflows to Increase Output

The Madness of Crowds

• Evolution• Training/Procedures• (dis)Organization

• Backlog– Robust metadata standard to enable repurposing and “sharability”

– Could take 10x more time to do metadata than scanning

– Volume of scanning didn’t leave much time for metadata

Page 23: Crowdsourcing Digitization: Harnessing Workflows to Increase Output

The Madness of Crowds

Page 24: Crowdsourcing Digitization: Harnessing Workflows to Increase Output

Crowd Control

Page 25: Crowdsourcing Digitization: Harnessing Workflows to Increase Output

1. Create Documentation2. “Teachable” standard3. Responsibility4. Quality5. Divide and Conquer?!?

Crowd Control

Page 26: Crowdsourcing Digitization: Harnessing Workflows to Increase Output

Crowd Control

1. Create Documentation

2. TEACH it3. Responsibility4. Quality: Live it, Learn it, Love it 5. Divide and

Conquer

6. file format

3. straightness and placement 1.

resolution 2. color

4. reference points (targets)

5. noise

Page 27: Crowdsourcing Digitization: Harnessing Workflows to Increase Output

Imaging EnvironmentDefined

Image StateRAW

Prepped for a specific

output

Output Referred -

looks towards output

Input Referred -

looks towards sensorOriginal Referred - defined

relationship between original and digital version

Current Practice

Emerging Practice

More technical

metadata is needed

Should be able to get

by with less

technical metadata

Puglia, 2007

Crowd Control

Page 28: Crowdsourcing Digitization: Harnessing Workflows to Increase Output

1. Create documentation2. TEACH it!

3. Quality: Live it, Learn it, Love it– Have curatorial staff check for accuracy and completeness

– DCR staff follow up with a check of a statistically significant portion for style and consistency

– Eventually, give curatorial staff to make corrections as they find them using the web-based administrative form

4. Responsibility5. Divide and conquer?!?

Crowd Control

Page 29: Crowdsourcing Digitization: Harnessing Workflows to Increase Output

1. Documentation2. “Teachable” standard3. Quality: Live it, Learn it, Love it

4. Responsibility– Someone has to have some

– But it doesn’t have to be an entire job5. Divide and Conquer

Crowd Control

Page 30: Crowdsourcing Digitization: Harnessing Workflows to Increase Output

1. Create documentation2. TEACH it!3. Quality: Live it, Learn it, Love it4. Responsibility

5. Divide and conquer?!?– Stub record created at request time; Cataloging enhances

Crowd Control

Page 31: Crowdsourcing Digitization: Harnessing Workflows to Increase Output

Crowd Control

1. Create documentation2. TEACH it!3. Quality: Live it, Learn it, Love it4. Responsibility5. Divide and conquer

6. Give up• Less control, more power

Page 32: Crowdsourcing Digitization: Harnessing Workflows to Increase Output

Crowd Control

• Would you want to try this?– Give yourself room to evolve and change through the project– Don’t feel like every image is a precious snowflake– More than any single technique, it’s the philosophy of crowdsourcing

that’s more important

Page 33: Crowdsourcing Digitization: Harnessing Workflows to Increase Output

l

Access to a low-quality scan…

…is still better than no access at all.

Crowd Control

• Would you want to try this?

– Don’t feel like every image is a precious snowflake

Page 34: Crowdsourcing Digitization: Harnessing Workflows to Increase Output

• Would you want to try this?

– More than any single technique, it’s the philosophy of crowdsourcing that’s important

Page 35: Crowdsourcing Digitization: Harnessing Workflows to Increase Output

Crowd Control

Page 36: Crowdsourcing Digitization: Harnessing Workflows to Increase Output
Page 37: Crowdsourcing Digitization: Harnessing Workflows to Increase Output

Attracting a Crowd

Page 38: Crowdsourcing Digitization: Harnessing Workflows to Increase Output

Attracting a Crowd

• Letting Go– “Letting go” creates efficiencies– Looking at expertise across the Libraries– Distribute the burden

Move away from “trophy” collections Move away from “trophy” collections

toward online Research Collectionstoward online Research Collections

Page 39: Crowdsourcing Digitization: Harnessing Workflows to Increase Output

Attracting a Crowd

• Distributed Problem-solving– Ideas from Archives:

• Organizing repository by subject rather than by collection• Dabbling in folder-level description (and digitization) rather

than just item-level

• Neutral Collection-building

Erway, Ricky and Jennifer Schaffner. 2007, “Gearing Up to Get Into the Flow.” Report produced by OCLC

Programs and Research (formerly RLG)

Page 40: Crowdsourcing Digitization: Harnessing Workflows to Increase Output

Attracting a Crowd

• Distributed Problem-solving– Ideas from Archives:

• Using “stub records” from patron request forms• Dabbling in folder-level description (and digitization) rather

than just item-level

• “Neutral” Collection-building― Wikipedia-style collection-building― Building a collection with wide

range

Page 41: Crowdsourcing Digitization: Harnessing Workflows to Increase Output

Attracting a Crowd

• Mass digitization– Google projects:

• Books• Newspapers

• Mass decision- making– Instead of item-level

decision-making

Page 42: Crowdsourcing Digitization: Harnessing Workflows to Increase Output

Attracting a Crowd

• Making Digitization a Core Function of the Library– Mission Statements come to life!– Organizing around digitization – very little has really

been done yet

Why? For researchersFor researchers

• “Fringe activities” need to become core investments

― Metadata creation― Digitization

Council on Library and Information Resources (CLIR). No Brief Candle: Reconceiving Research Libraries for the 21st Century, 2008.

Page 43: Crowdsourcing Digitization: Harnessing Workflows to Increase Output

Crowdsourcing Digitization

THANKS!

Access these slides at:http://www.personal.ecu.edu/presentations/Crowdsourcing.ppt

Or:http://www.slideshare.net

Gretchen [email protected] Carolina UniversityGreenville, North Carolina

Ann [email protected] UniversityMilwaukee, Wisconsin