70
No tempest in my teapot: Analysis of Crowdsourced Data and User Experiences at the California Digital Newspaper Collection Brian Geiger Director, Center for Bibliographical Studies and Research California Digital Newspaper Collection Frederick Zarndt Chair, IFLA Newspapers Section Photo held by John Oxley Library, State Library of Queensland. Original from Courier-mail, Brisbane, Queensland, Australia.

20121105 no tempest in my teapot [dlf forum denver]

Embed Size (px)

Citation preview

Page 1: 20121105 no tempest in my teapot [dlf forum denver]

No tempest in my teapot:Analysis of Crowdsourced Data

and User Experiences at the California Digital Newspaper

Collection

Brian GeigerDirector, Center for Bibliographical Studies and Research

California Digital Newspaper Collection

Frederick ZarndtChair, IFLA Newspapers Section

Photo held by John Oxley Library, State Library of Queensland. Original from

Courier-mail, Brisbane, Queensland, Australia.

Page 2: 20121105 no tempest in my teapot [dlf forum denver]

Crowds

Page 3: 20121105 no tempest in my teapot [dlf forum denver]

The Wisdom of Crowds

In 2004 James Surowiecki published “The Wisdom of Crowds: Why the Many Are Smarter Than the Few and How Collective Wisdom Shapes Business, Economies, Societies and Nations”. In it he asserts

a crowd of persons that are diverse, independent, and decentralized usually make better judgements or decisions than single persons

Page 4: 20121105 no tempest in my teapot [dlf forum denver]

“crowdsourcing”

was coined by Jeff Howe in “The rise of crowdsourcing” published in Wired magazine June 2006.

Page 5: 20121105 no tempest in my teapot [dlf forum denver]

A Google advanced search for “crowdsourcing” from 1-Jun-2006, the date of publication of Jeff Howe’s Wired magazine

article, to 1-Jun-2007 gives 44,600 hits.

A date range of 1-Jun-2011 to 1-Jun-2012 gives 2,680,000 hits.

Searches used the Internet Archives’ Wayback Machine

Page 6: 20121105 no tempest in my teapot [dlf forum denver]

Crowdsourcing is a process that involves outsourcing tasks to a distributed group of people. ... the difference between

crowdsourcing and ordinary outsourcing is that a task or problem is outsourced to an

undefined public rather than a specific body, such as paid employees.

Wikipedia contributors, "Crowdsourcing," Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/wiki/Crowdsourcing (accessed June 1, 2012)

Page 7: 20121105 no tempest in my teapot [dlf forum denver]

Crowdsourcing is a type of participative online activity in which an individual, an institution, a non-profit organization, or company proposes to a group of individuals of varying knowledge, heterogeneity, and number, via a flexible open call, the voluntary undertaking of a task. The undertaking of the task, of variable complexity and modularity, and in which the crowd should participate bringing their work, money, knowledge and/or experience, always entails mutual benefit. The user will receive the satisfaction of a given type of need, be it economic, social recognition, self-esteem, or the development of individual skills, while the crowdsourcer will obtain and utilize to their advantage that what the user has brought to the venture, whose form will depend on the type of activity undertaken.

Enrique Estellés-Arolas and Fernando González-Ladrón-de-Guevara. Towards an integrated crowdsourcing definition. Journal of Information Science XX(X). 2012. pp. 1-14.

Page 8: 20121105 no tempest in my teapot [dlf forum denver]

crowd*

crowdfu

nding

citizen science

crowdcasting

crowdsourcing

crowdvotingcrowdcollaboration

Page 9: 20121105 no tempest in my teapot [dlf forum denver]

what is Alexa?• Alexa collects and analyzes Internet data for purposes of web analytics. Web analytics is

the measurement, collection, analysis and reporting of Internet data for the purposes of understanding and optimizing web usage. Alexa is now a subsidiary of Amazon.

• Alexa was founded in 1996 by Brewster Kahle (Internet Archive) and Bruce Gilliat.

• Alexa operations includes archiving of webpages as they are crawled. This database served as the basis for the creation of the Internet Archive accessible through the Wayback Machine.

• Alexa continually crawls all publicly-available websites to create a series of snapshots of the web.

• Alexa gathers information from a variety of sources to provide key statistics about each site on the web, for example, Traffic Rank, the number of PageViews, and site Speed, Bounce Rate, etc. This information is derived from Alexa toolbar users (~6,000,000 worldwide).

Page 10: 20121105 no tempest in my teapot [dlf forum denver]

definitions

• A PageView is a request for a file whose type is defined as a page.

• A Unique Visitor is a uniquely identified client generating requests on the web server or viewing pages within a defined time period (i.e. day, week or month). A Unique Visitor counts once within the timescale.

• A Visit is a series of page requests from the same uniquely identified client with a time of no more than 30 minutes between each page request.

• Bounce Rate is the percentage of visits where the visitor enters and exits at the same page without visiting any other pages on the site in between.

• World | Country Rank is a function of the average daily unique visits and the number of unique pages requested.

definitions adapted from Wikipedia http://en.wikipedia.org/wiki/Web_analytics

Page 11: 20121105 no tempest in my teapot [dlf forum denver]

Amazon Mechanical Turk was launched Nov 2005 Alexa global rank of Amazon Mechanical Turk (13-Jun-2012): 6,022

crowdsourcing

Page 12: 20121105 no tempest in my teapot [dlf forum denver]

crowdsourcing

Each day 200,000,000 recaptcha’s are solved by humans around the world

Page 13: 20121105 no tempest in my teapot [dlf forum denver]

crowdvoting

Iowa Electronic Market was 1st launched in 1995

Alexa global traffic rank of Iowa Electronic Market (6-Aug-2012): 11,290

Alexa US traffic rank of Iowa Electronic Market (6-Aug-2012): 3,923

Page 14: 20121105 no tempest in my teapot [dlf forum denver]

Galaxy Zoo was 1st launched July 2007 Alexa global traffic rank of Galaxy Zoo (13-Jun-2012): 557,766

citizen science

Page 15: 20121105 no tempest in my teapot [dlf forum denver]

Kickstarter was 1st launched in 2008 Alexa global traffic rank of Kickstarter (6-Aug-2012): 752

27,528 projects successfully funded with more than USD $254,000,000

crowdfunding

Page 16: 20121105 no tempest in my teapot [dlf forum denver]

crowdcollaboration

Page 17: 20121105 no tempest in my teapot [dlf forum denver]

Wikipedia

• Began 2001

• Now in 285 languages

• 3,900,000+ articles in English, 1,400,000+ in German, 1,250,000+ in French, 1,050,000 in Dutch

• 40 wikipedia languages with more than 100,000 articles

• 112 wikipedia languages with more than 10,000 articles

• 400,000,000 unique visitors per month

• 85,000 active contributors

• Alexa global traffic rank: #6 in worldwide web traffic

Page 18: 20121105 no tempest in my teapot [dlf forum denver]
Page 19: 20121105 no tempest in my teapot [dlf forum denver]

Family Search Indexing was 1st launched (beta) 2004 Alexa global / country traffic rank of FamilySearch (13-Jun-2012): 4,352 / 1,357

Page 20: 20121105 no tempest in my teapot [dlf forum denver]

• Started (beta) 2004

• More than 780,000 worldwide registered volunteers from ~25 countries index records relevant to family history

• Approximately 100,000 active volunteers each month

• UI in Chinese, English, German, French, Italian, Japanese, Korean, Portuguese, and Russian

• Blind double-key entry with arbitration / reconciliation

• More than 1,500,088,741 records indexed (July 2012)

• Accuracy typically > 99.95%

Page 21: 20121105 no tempest in my teapot [dlf forum denver]

Project Gutenberg was 1st launched Dec 1971 Alexa global traffic rank of Project Gutenberg (13-Jun-2012): 5,744

Page 22: 20121105 no tempest in my teapot [dlf forum denver]

• Started Dec 1971

• Worldwide volunteers transcribe or proofread OCR’d public domain books through Distributed Proofreaders

• 40,000 books completed (July 2012)

• Partner / affiliated projects for Australia, Canada, Europe, Germany, Luxembourg, Philippines, Runeberg (Nordic literature), Russia, Taiwan

Page 23: 20121105 no tempest in my teapot [dlf forum denver]

Alexa global / country traffic rank of National Library of Australia (31-Oct-2012): 15,519 / 406Trove gets ~72% of all National Library web traffic.

Page 24: 20121105 no tempest in my teapot [dlf forum denver]

National Library of Australia

• Online since 2008• 7,200,000+ pages• Top text corrector 1,250,000 lines (June 2012)• 2,450,000+ lines corrected each month (average

for 1st 6 months 2012)• 68,908,757 lines corrected as of July 2012, up

from 42,411,468 lines corrected July 2011.• 63,613 total registered users (July 2012)• 4,146 active users (June 2012)

Page 25: 20121105 no tempest in my teapot [dlf forum denver]

Alexa global / country traffic rank of National Library of Finland2,535,854 (31-Oct-2012) / 199 (2-Apr-2012)

Page 26: 20121105 no tempest in my teapot [dlf forum denver]

National Library of Finland

• Digitalkoot is a project to improve OCR text in digitized newspapers -- by playing games!

• Digitalkoot is a collaboration between the National Library and Microtask

• Players correct OCR text by playing Myyräsillassa (Mole Bridge) or Myyräjahdissa (Mole Hunt)

• National Library has 4,000,000+ digitized pages• 109,321 registered players (October 2012)• Since February 2011 8,024,530 micro-tasks have

been completed

Page 27: 20121105 no tempest in my teapot [dlf forum denver]

Alexa global / country traffic rank of UC Riverside (31-Oct-2012): 12,439 / 4,717CDNC gets ~1.84% of all UC Riverside web traffic.

Page 28: 20121105 no tempest in my teapot [dlf forum denver]

California Digital Newspaper Collection

• CDNC began digitizing newspapers in 2005 as part of NDNP

• Newspapers digitized to article-level as well as to page-level as required by NDNP

• Hosted on Veridian beginning 2009

• Collection size 55,970 issues, 495,175 pages, 5,658,224 articles, 498,000,000+ lines

Page 29: 20121105 no tempest in my teapot [dlf forum denver]

OCR text correction

• OCR text correction added August 2011

• Corrections are done line by line

• ~578,000+ lines of text corrected (Oct 2012)

• ~1.1% of the collection corrected, 98.9% to go!

• Top corrector 243,000 lines > 2x 2nd corrector

Page 30: 20121105 no tempest in my teapot [dlf forum denver]

User Lines corrected1 242,9652 87,5153 31,3184 24,1445 23,1846 19,2407 18,8988 16,8759 11,78410 9,762

Lines corrected User1,456,906 11,385,369 21,010,360 3960,230 4847,340 5786,147 6657,187 7600,513 8582,276 9565,384 10

Page 31: 20121105 no tempest in my teapot [dlf forum denver]

uncorrected OCR accuracy by newspaper title

Title OCR character accuracy

~OCR word accuracy*

PRP Pacific Rural Press 1871 - 1922 92.6% 68.1%

SFC San Francisco Call 1890 - 1913 92.6% 68.1%

LAH Los Angeles Herald 1873 - 1910 88.7% 54.9%

LH Livermore Herald 1877 - 1899 88.6% 54.6%

DAC Daily Alta California 1841 - 1891 88.2% 53.4%

CFJ California Farmer and Journalof Useful Sciences 1855 - 1880 86.5% 48.4%

SN Sausalito News 1885 - 1922 70.4% 17.3%

*Word accuracy assumes average word length is 5 characters

Page 32: 20121105 no tempest in my teapot [dlf forum denver]

OCR accuracy by newspaper title

Title OCR character accuracy

Corrected accuracy

PRP Pacific Rural Press 1871 - 1922 92.6% 99.3%

SFC San Francisco Call 1890 - 1913 92.6% 99.6%

LAH Los Angeles Herald 1873 - 1910 88.7% 99.1%

LH Livermore Herald 1877 - 1899 88.6% 99.9%

DAC Daily Alta California 1841 - 1891 88.2% 99.9%

CFJ California Farmer and Journalof Useful Sciences 1855 - 1880 86.5% 99.8%

SN Sausalito News 1885 - 1922 70.4% 100.0%

Page 33: 20121105 no tempest in my teapot [dlf forum denver]

corrected accuracy by newspaper title

Title OCR character accuracy

~OCR word accuracy*

Corrected accuracy

~Corrected word accuracy*

PRP 1871 - 1922 92.6% 68.1% 99.3% 96.5%

SFC 1890 - 1913 92.6% 68.1% 99.6% 98.0%

LAH 1873 - 1910 88.7% 54.9% 99.1% 95.6%

LH 1877 - 1899 88.6% 54.6% 99.9% 99.5%

DAC 1841 - 1891 88.2% 53.4% 99.9% 99.5%

CF 1855 - 1880 86.5% 48.4% 98.3% 91.8%

SN 1885 - 1922 70.4% 17.3% 100.0% 100.0%

*Word accuracy assumes average word length is 5 characters

Page 34: 20121105 no tempest in my teapot [dlf forum denver]

correction accuracyby user

User Average OCR accuracy

Correction accuracy

A 70.4% 100.0%B 87.1% 99.5%C 95.4% 99.5%D 86.5% 98.3%E 95.3% 100.0%F 91.0% 100.0%G 91.0% 99.8%H 90.5% 99.0%I 96.6% 99.8%J 94.8% 100.0%K 86.8% 99.3%

Page 35: 20121105 no tempest in my teapot [dlf forum denver]

the long tail* of crowdsourced OCR text correction

a probability distribution has a long tail if a larger share of population rests within its tail than it would

under a normal distribution

the most productive users represent a small fraction of the total user population and ~50% of total

production, or, said a different way, the largest fraction but individually not quite so productive

users are as important as the most productive users

The phrase “long tail” was popularized by Chris Anderson in the October 2004 Wired magazine article The Long Tail and by Clay Shirky’s February 2003 essay “Power laws, web logs, and inequality”.

Page 36: 20121105 no tempest in my teapot [dlf forum denver]

OCR text correction long tails

0

75000

150000

225000

300000

CDNC lines corrected by text corrector

0

750,000

1,500,000

2,250,000

3,000,000

NLA lines corrected by text corector

top corrector 242,965 top corrector 1,456,906

50%

50%

50%

50%

Page 37: 20121105 no tempest in my teapot [dlf forum denver]

Graphic from Kaufmann et al. “More than fun and money. Worker Motivation in Crowdsourcing – A Study on Mechanical Turk.”

Motivation

Page 38: 20121105 no tempest in my teapot [dlf forum denver]

Wisdom of crowds

James Surowiecki, The Wisdom of Crowds: Why the Many Are Smarter Than the Few and How Collective Wisdom Shapes Business, Economies, Societies and Nations, Anchor Books, New York, 2005.

DiversityEach person should have private information even if it's just an eccentric interpretation of the known facts.

Independence People's opinions aren't determined by the opinions of those around them.

Decentralization People are able to specialize and draw on local knowledge.

Aggregation Some mechanism exists for turning private judgments into a collective decision.

Page 39: 20121105 no tempest in my teapot [dlf forum denver]

Cognitive surplus

... people are learning to use their free time for creative activities rather than consumptive ones [such as watching TV] ...

... the total human cognitive effort in creating all of Wikipedia in every language is about one hundred million hours ...

... Americans alone watch two hundred billion hours of TV every year, or enough time, if it would be devoted to projects similar to Wikipedia, to create about 2000 of them ...

Clay Shirky. Cognitive surplus: Creativity and generosity in a connected age. Penguin Press. New York. 2010.

Page 40: 20121105 no tempest in my teapot [dlf forum denver]

MotivationGenealogists and family historians

• National Library of Australia’s 2012 Trove status report showed that ~50% of Trove users are family historians

• National Library of New Zealand survey found that ~50% of PapersPast users are genealogists

• California Digital Newspaper Collection spring 2012 survey discovered that ~70% of its users are genealogists; 75% are 50 years old or older

• A Utah Digital Newspapers survey showed that 72% of its users are genealogists

PAPERSPAST

* John Herbert and Randy Olsen. “Small town papers: Still delivering the news”. Paper given at 2012 World Library and Information Congress. Helsinki. August 2012.

Page 41: 20121105 no tempest in my teapot [dlf forum denver]

• “I enjoy the correction - it’s a great way to learn more about past history and things of interest whilst doing a ‘service to the community’ by correcting text for the benefit of others.”

• “I have recently retired from IT and thought that I could be of some assistance to the project. It benefits me and other people. It helps with family research.”

From Rose Holley in “Many Hands Make Light Work.” National Library of Australia March 2009.

MotivationTrove users’ report

Page 42: 20121105 no tempest in my teapot [dlf forum denver]

“I am interested in all kinds of history. I have pursued genealogy as a hobby for many years. I correct text at CDNC because I see it as a constructive way to contribute to a worthwhile project.

Because I am interested in history, I enjoy it.”Wesley, California

Personal communications with CDNC text correctors.

MotivationCDNC users’ report

Page 43: 20121105 no tempest in my teapot [dlf forum denver]

“I only correct the text on articles of local interest - nothing at state, national or international level, no advertisements, etc.  The objective is to be able to help researchers to locate local people, places, organizations and events using the on-line

search at CDNC.  I correct local news & gossip, personal items, real estate transactions, superior court proceedings, county and

local board of supervisors meetings, obituaries, birth notices, marriages, yachting news, etc.”

Ann, California

Personal communications with CDNC text correctors.

MotivationCDNC users’ report

Page 44: 20121105 no tempest in my teapot [dlf forum denver]

“I am correcting text for the Coronado Tent City Program for 1903.  It is important to correct any problems with personal names and other information so that researchers will be able

to search by keyword and be assured of retrieving desired results. ... type fonts cause a great deal of difficulty in

digitizing the text and can cause problems for searchers.  Also, many of the guests' names at Tent City and Hotel Del

Coronado were taken from the registration books and reported in the Program.  This led to many problems in spelling of last names and the editors were not careful to be consistent in the

spellings.  This Program is an important resource since it provides an excellent picture of daily life in Tent City and

captures much of the history of Coronado itself.”Gene, California

Personal communications with CDNC text correctors.

MotivationCDNC users’ report

Page 45: 20121105 no tempest in my teapot [dlf forum denver]

“I have always been interested in history, especially the development of the American West, and nothing brings it alive

better than newspapers of the time. I believe them to be an invaluable source of knowledge for us and future generations.”

David, United Kingdom

Personal communications with CDNC text correctors.

MotivationCDNC users’ report

Page 46: 20121105 no tempest in my teapot [dlf forum denver]

CDNC is an excellent source of information matching my personal interest in such topics as sea history, development

of shipbuilding, clippers and other ships etc. ... Unfortunately, the quality of text ... is rather poor I’m

afraid. This is why I started to do all corrections necessary for myself ... and to leave the corrected text for use of

others. .... I am not doing this very regularly as this is just my hobby and pleasure.

Jerzey, Poland

Personal communications with CDNC text correctors.

MotivationCDNC users’ report

Page 47: 20121105 no tempest in my teapot [dlf forum denver]

Website traffic

Page 48: 20121105 no tempest in my teapot [dlf forum denver]

Website traffic

After a crowdsourcing transcription project of diaries from the American War Between the States, Nicole Saylor, Head of Digital Library Services at the University of Iowa Libraries, reported

“On June 9, 2011, we went from about 1000 daily hits to our digital library on a really good day to more than 70,000.”

Nicole Saylor interviewed by Trevor Owens. “Crowdsourcing the Civil War: Insights Interview with Nicole Saylor” blog post at http://blogs.loc.gov/digitalpreservation/2011/12/crowdsourcing-the-civil-war-insights-interview-with-nicole-saylor/. Dec 6, 2011.

Page 49: 20121105 no tempest in my teapot [dlf forum denver]

Website traffic

Website traffic at CDNC before / after implementing crowdsourcing

before crowdsourcing11-Jun-2011 / 12-Jul-2011

after crowdsourcing11-Jun-2012 / 12-Jul-2012 change

visits 17,485 21,488 +22.9%

unique visitors 11,381 13,376 +17.5%

visit duration 9m 24s 11m 7s +18.3%

bounce rate 51.3% 44.5% -6.8%

pages per visit 14.9 11.7 -21.5%

Page 50: 20121105 no tempest in my teapot [dlf forum denver]

Website traffic

Page 51: 20121105 no tempest in my teapot [dlf forum denver]

Crowdsourcing benefits

Public domain photo courtesy of US Navy

Page 52: 20121105 no tempest in my teapot [dlf forum denver]

$Economics

Financial value of outsourced OCR text correction for newspapers?

The Assumptions

• 25 to 50 characters per line in a newspaper column: Assume 40 characters per line (CDNC sample average)

• Outsourced text transcription or correction costs USD $0.35 to $1.20 per 1000 characters: Assume $0.50 per 1000 characters

Page 53: 20121105 no tempest in my teapot [dlf forum denver]

$Economics

$ 578,000 lines x 40 characters per line x 1/1000 x $0.50 = $11,560

$ 68,908,757 lines x 40 characters per line x 1/1000 x $0.50 = $1,378,175

Page 54: 20121105 no tempest in my teapot [dlf forum denver]

$Economics

Financial value of in-house OCR text correction?

The Assumptions

• Correction takes 15 seconds per line

• Cost is hourly wage plus benefits of lowest level employee, $10 for CDNC, $41.88* for Australia

AUD $40.38 = USD $41.88 is the actual labor value assumed by the National Library of Australia to calculate avoided costs due to crowdsourced OCR text correction in its 2012 Trove Status Report.

Page 55: 20121105 no tempest in my teapot [dlf forum denver]

$Economics

$ 578,000 lines x 15 seconds per line x 1/3600 hrs per second x $10.00 per hr = $24,083

$ 68,908,757 lines x 15 seconds per line x 1/3600 hrs per second x $41.88 per hr = $12,024,578

Page 56: 20121105 no tempest in my teapot [dlf forum denver]

Accuracy

“His Accuracy Depends on Ours!"Office for Emergency Management. Office of War Information. Domestic Operations Branch. Bureau of Special Services. [Photo held at US National Archives and Records Administration]

Page 57: 20121105 no tempest in my teapot [dlf forum denver]

Accuracy

• Edwin Kiljin (Koninklijke Bibliotheek the Netherlands) reports raw OCR character accuracies of 68% for early 20th century newspapers

• Rose Holley (National Library of Australia) reports raw OCR character accuracy varied from 71% to 98% on a sample Trove digitized newspapers

Rose Holley. “How good can it get? Analysing and improving OCR accuracy in large scale historic newspaper digitisation programs. D-Lib Magazine. March/April 2009.

Edwin Kiljin. “The current state-of-art in newspaper digitization.” D-Lib Magazine. January/February 2008.

Public domain graphic courtesy of Wikimedia Commons.

Page 58: 20121105 no tempest in my teapot [dlf forum denver]

AccuracyMapping texts* assesses digitization quality of digital

newspapers by comparing the number of words recognized to the total number of words scanned

* Mapping texts is a collaboration between the University of North Texas and Stanford University aimed at experimenting with new methods for finding and analyzing meaningful patterns embedded in massive collections of digital newspapers.

Page 59: 20121105 no tempest in my teapot [dlf forum denver]

How does low text accuracy affect search recall?The Facts• Average uncorrected OCR character accuracy of the

CDNC sample data is ~89%

• Average length of an English word is 5 characters

• Average word accuracy is 89% x 89% x 89% x 89% x 89% = 55.8% - round up to 60% or 6 out of 10 words correct

Accuracy

Public domain graphic courtesy of Wikimedia Commons.

Page 60: 20121105 no tempest in my teapot [dlf forum denver]

ARNDT

ARNDTARNDT

ARNDT ARNDT

ARNDT

ARNDT

ARNDT

ARNDT

ARNDT

Search recall no text correction

instances of “ARNDT” found instances of “ARNDT” not found

Page 61: 20121105 no tempest in my teapot [dlf forum denver]

Accuracy

The Facts• Average corrected character accuracy of the CDNC

sample data is ~99.4%

• Average word accuracy of CDNC corrected text is 99.4% x 99.4% x 99.4% x 99.4% x 99.4% = 97.0%

Public domain graphic courtesy of Wikimedia Commons.

Page 62: 20121105 no tempest in my teapot [dlf forum denver]

ARNDT

ARNDTARNDT

ARNDT ARNDT

ARNDT

ARNDT

ARNDT

ARNDT

ARNDT

instances of “ARNDT” found instances of “ARNDT” not found

Search recall with text correction

Page 63: 20121105 no tempest in my teapot [dlf forum denver]

A search for “Arndt” at Chronicling America gives 10,267 results*• If Chronicling America text accuracy is 55.8% (same

as uncorrected CDNC sample), then 8,133 instances of “Arndt” were not found

• If text accuracy is 97.0%, then 317 instances of “Arndt” were not found

Accuracy

* Search performed 31 Oct 2012Alexa global / country traffic rank of Library of Congress (31-Oct-2012): 4,056 / 1,317

Chronicling America gets ~7.1% of all Library of Congress web traffic.

Public domain graphic courtesy of Wikimedia Commons.

Page 64: 20121105 no tempest in my teapot [dlf forum denver]

Hard-to-measure-but-shouldn’t-be-overlooked

benefits

Public domain photo “A useful instruction for young sailors from the Royal Hospital School, Greenwich” from the National Maritime Museum.

Page 65: 20121105 no tempest in my teapot [dlf forum denver]

“when someone transcribes a document, they are actually better fulfilling the mission of a cultural

heritage organization than someone who simply stops by to flip through the pages”

HTMBSBO benefit

Paraphrased from Trevor Owen’s Crowdstorming blog http://crowdstorming.wordpress.com/

Page 66: 20121105 no tempest in my teapot [dlf forum denver]

“in addition to increasing search accuracy or lowering the costs of document transcription, crowdsourcing is

the single greatest advancement in getting people using and interacting with library collections”

Paraphrased from Trevor Owen’s Crowdstorming blog http://crowdstorming.wordpress.com/

HTMBSBO benefit

Page 67: 20121105 no tempest in my teapot [dlf forum denver]

Crowdsourcing considerations• How to market / advertise

crowdsourcing?

• How to motivate crowdsourcers?

• Is authentication / identity of crowdsourcers an issue?

• How to administer crowdsourced data?

Photo of Aleister Crowley [Public domain] from Wikimedia Commons

Page 68: 20121105 no tempest in my teapot [dlf forum denver]

Conclusions

Conclusion of the Sonata for piano #32, opus 111 by Ludwig van Beethoven

• Lots of crowdsourcing in cultural heritage organizations and elsewhere

• Benefits are multi-faceted: Economic, data accuracy, patron engagement, increased web traffic

Page 69: 20121105 no tempest in my teapot [dlf forum denver]

Correct Russian language periodicalshttp://bit.ly/russianperiodicals

Try crowdsourcing!Correct California newspapers text

http://cdnc.ucr.edu

Correct Cambridge MA newspapers text http://bit.ly/cambridgepublic

Others soon to follow: Library of Virginia, University of Tennessee, National Library of Singapore, ...

Correct Australian newspapers texthttp://trove.nla.gov.au

Page 70: 20121105 no tempest in my teapot [dlf forum denver]

?Brian Geiger

[email protected]

Frederick [email protected]

Photo held by John Oxley Library, State Library of Queensland. Original from

Courier-mail, Brisbane, Queensland, Australia.