20140408 digital newspapers collections [idlc kuala lumpur]
71
digital newspaper collections: if you build one, who will visit? Frederick Zarndt IFLA Newspapers Section [email protected]@cowboyMontana hashtag #IFLAnewspaper
20140408 digital newspapers collections [idlc kuala lumpur]
All about historic newspapers digitization at the 2014 International Digital Library Conference in Kuala Lumpur 8-Apr-2014.
Citation preview
digital newspaper collections: if you build one, who will
visit? Frederick Zarndt IFLA Newspapers Section
[email protected] @cowboyMontana hashtag
#IFLAnewspaper
about digital newspapers programs collections users /
crowdsourcing San Francisco Call 21 April 1906
why digitize newspapers? News is only the first rough draft of
history. Alan Barth writing for 1943 Washington Post Wikipedia
contributors, Alan Barth," Wikipedia, The Free Encyclopedia,
https://en.wikipedia.org/wiki/ Alan_Barth (accessed March
2014).
to preserve to provide access why digitize newspapers?
newspapers are deteriorating microfilm is dissolving no storage
space or space is too expensive
newspapers are deteriorating microfilm is dissolving no storage
space or space is too expensive
newspapers are deteriorating microfilm is dissolving no storage
space or space is too expensive
newspapers are deteriorating microfilm is dissolving no storage
space or space is too expensive
the principal reason to digitize newspapers is to provide
non-destructive, universal access to newspapers for as many users
as possible
PhotobyDAVIDILIFF.License:CC-BY-SA3.0 readingrooms
bythenumbers* Monthly average Visitors Requests for Newspapers
Population Reading Room Microform Print Australia 22,876,000 5,130
345 240 France 65,350,000 3,000 2,000 1,000 Netherlands 16,847,000
NA NA NA New Zealand 4,414,000 NA NA NA Norway 4,985,000 600 400 NA
Singapore 5,184,000 NA 300 NA UK 62,262,000 2,000 6,900 4,816 USA
313,292,000 NA NA NA *numbers from 2012
physical versus digital monthly averages 2012 requests for
newspapers digitised historical newspapers population paper +
microform unique visitors 22,876,000 585 150,000 37,692,000 NA
12,800 5,405,000 NA NA 65,350,000 3,000 22,000 16,847,000 NA 50,000
4,414,000 NA 83,333 4,985,000 400 1,500 5,184,000 300 12,400
62,262,000 11,716 NA 313,292,000 NA NA
Image from
http://www.visualinsight.net/nc/gallery/pages/e-Preservation.html
newspaper digitization is expensive newspaper digitization is
complicated digital preservation is expensive digital preservation
is untested BUT
programs
programs National Cooperative Individual
national: a single (national) library which funds and manages a
national newspapers digitization program. Papers Past, National
Library of New Zealand Newspaper SG, National Library of Singapore
Historiallinen Sanomalehtikirjasto, National Library of Finland and
others programs
national: centrally funded and centrally managed program with
several participants. strict standards for participants. National
Digital Newspaper Program (Library of Congress) Australian
Newspaper Digitisation Program programs
cooperative: organizations collaborate to achieve a common goal
but digitization programs are managed separately. flexible
standards. Europeana newspapers Digital Public Library of America
programs
individual: organization digitizes on its own. may or, more
usually, does not follow open standards. all commercial
organizations. ProQuest Historical Newspapers Newspapers.com
Newsbank many others programs
the design of a digitization program requires careful thought
and must be adapted to local circumstances determine principal or
targeted user demographic and use cases ask those who have gone
before join the IFLA Newspapers Section! (ask me how) programs
Image courtesy of Donald Zolan.
collections
as of Mar 2014 library collection ~size pages dates National
Library of Australia Trove 12,668,000 1803-1995 California Digital
Newspaper Collection CDNC 545,000 1846-2012 Naitonal Library of
Finland Historical Newspaper Library 3,006,000 1771-1919
Bibliotheque nationale de France Gallica 2,200,000 1293-2000
Koninklijke Bibliotheek Historische Kranten 9,000,000 1618-1995
National Library of New Zealand Papers Past 3,109,000 1839-1945
National Library of Norway NBDigital Aviser 12,000,000 1763-2012
Singapore National Library Newspaper SG 2,400,000 1831-2009 British
Library British Newspaper Archive 7,598,000 1710-1954 Library of
Congress Chronicling America 7,293,000 1836-1922 digital historic
newspaper collections
Newspaper collection user survey California Digital Newspaper
Collection and Cambridge Public Library published a user survey in
Mar 2013 604 / 32 responses surveys are (mostly) identical except
for organization name
User demographic: genealogists and family historians
User demographic: no spring chickensX
User demographic: reasons for use
User demographic: types of information
72% visit UDN for genealogical research 20% visit for various
other types of historical research 87% find obituaries useful Over
60% find the other genealogical article types (birth and wedding
announcements) useful Only 7% do not find genealogical articles
useful Many are writing family histories and consequently also look
for general background information Older content is much more
highly valued than more recent content (see more detailed
explanation that follows) 44% find smaller, rural papers more
useful, while only 15% find larger, metropolitan papers more useful
Utah Digital Newspapers: 2012 user survey John Herbert and Randy
Olsen. Small town papers: still delivering the news. WLIC 2012,
Helsinki Finland. http://conference.ifla.org/past-wlic/2012/119-
herbert-en.pdf
The typical Trove user is a very well educated, highly paid,
English speaking employed woman aged fifty or over, with a
significant or primary interest in family or local history, who
visits the Trove website very frequently. Users of Trove newspapers
are older than the average Trove user; only 13% of newspaper users
are under 40 years or age. Marie-Louise Ayres. Singing for their
supper: Trove, Australian newspapers, and the crowd. WLIC
2013,Singapore. http:// library.ifla.org/245/1/153-ayres-en.pdf.
Engaged users: who are they?
Many of Troves user engagement features are very popular. More
than 100,000 users have registered to date, and more than 2 million
tags and nearly 60,000 comments had been added [Trove] text
correction, however, stands head and shoulders above any other user
engagement features. Marie-Louise Ayres. Singing for their supper:
Trove, Australian newspapers, and the crowd. WLIC 2013,Singapore.
http:// library.ifla.org/245/1/153-ayres-en.pdf. Engaged users: who
are they?
Crowdsourcing is the practice of obtaining needed services,
ideas, or content by soliciting contributions from a large group of
people, and especially from an online community, rather than from
traditional employees or suppliers. ... [It] is different from
ordinary outsourcing since it is a task or problem that is
outsourced to an undefined public rather than a specific, named
group. Wikipedia contributors, "Crowdsourcing," Wikipedia, The Free
Encyclopedia, http:// en.wikipedia.org/wiki/Crowdsourcing (accessed
March 17, 2013)
Why correct text? Heres why ...
Deaths. llnrieff, Esq. of
Accuracy Edwin Kiljin (Koninklijke Bibliotheek the Netherlands)
reports raw OCR character accuracies of 68% for early 20th century
newspapers Rose Holley (National Library of Australia) reports raw
OCR character accuracy varied from 71% to 98% on a sample Trove
digitized newspapers Rose Holley. How good can it get? Analysing
and improving OCR accuracy in large scale historic newspaper
digitisation programs. D-Lib Magazine. March/April 2009. Edwin
Kiljin. The current state-of-art in newspaper digitization. D-Lib
Magazine. January/February 2008.
uncorrected OCR accuracy by newspaper title title OCR character
accuracy ~OCR word accuracy PRP Pacific Rural Press 1871 - 1922
92.6% 68.1% SFC San Francisco Call 1890 - 1913 92.6% 68.1% LAH Los
Angeles Herald 1873 - 1910 88.7% 54.9% LH Livermore Herald 1877 -
1899 88.6% 54.6% DAC Daily Alta California 1841 - 1891 88.2% 53.4%
CFJ California Farmer and Journal of Useful Sciences 1855 - 1880
86.5% 48.4% SN Sausalito News 1885 - 1922 70.4% 17.3% *Word
accuracy assumes average word length is 5 characters
OCR accuracy by newspaper title title OCR character accuracy
corrected accuracy PRP Pacific Rural Press 1871 - 1922 92.6% 99.3%
SFC San Francisco Call 1890 - 1913 92.6% 99.6% LAH Los Angeles
Herald 1873 - 1910 88.7% 99.1% LH Livermore Herald 1877 - 1899
88.6% 99.9% DAC Daily Alta California 1841 - 1891 88.2% 99.9% CFJ
California Farmer and Journal of Useful Sciences 1855 - 1880 86.5%
99.8% SN Sausalito News 1885 - 1922 70.4% 100.0%
corrected accuracy by newspaper title title OCR character
accuracy ~OCR word accuracy corrected accuracy ~corrected word
accuracy PRP 1871 - 1922 92.6% 68.1% 99.3% 96.5% SFC 1890 - 1913
92.6% 68.1% 99.6% 98.0% LAH 1873 - 1910 88.7% 54.9% 99.1% 95.6% LH
1877 - 1899 88.6% 54.6% 99.9% 99.5% DAC 1841 - 1891 88.2% 53.4%
99.9% 99.5% CF 1855 - 1880 86.5% 48.4% 98.3% 91.8% SN 1885 - 1922
70.4% 17.3% 100.0% 100.0% *Word accuracy assumes average word
length is 5 characters
correction accuracy by user user average OCR accuracy
correction accuracy A 70.4% 100.0% B 87.1% 99.5% C 95.4% 99.5% D
86.5% 98.3% E 95.3% 100.0% F 91.0% 100.0% G 91.0% 99.8% H 90.5%
99.0% I 96.6% 99.8% J 94.8% 100.0% K 86.8% 99.3%
How does low text accuracy affect search recall? The Facts
Average uncorrected OCR character accuracy of the CDNC sample data
is ~89% Average length of an English word is 5 characters Average
word accuracy is 89% x 89% x 89% x 89% x 89% = 55.8% - round up to
60% or 6 out of 10 words correct Accuracy
ARNDT ARNDT ARNDT ARNDT ARNDT ARNDT ARNDT ARNDT ARNDT ARNDT
Search recall no text correction instances of ARNDT found instances
of ARNDT not found
Accuracy The Facts Average corrected character accuracy of the
CDNC sample data is ~99.4% Average word accuracy of CDNC corrected
text is 99.4% x 99.4% x 99.4% x 99.4% x 99.4% = 97.0%
ARNDT ARNDT ARNDT ARNDT ARNDT ARNDT ARNDT ARNDT ARNDT ARNDT
instances of ARNDT found instances of ARNDT not found Search recall
with text correction
A search for Arndt at Chronicling America gives 10,267 results*
If Chronicling America text accuracy is 55.8% (same as uncorrected
CDNC sample), then 8,133 instances of Arndt were not found If text
accuracy is 97.0%, then 317 instances of Arndt were not found
Accuracy * Search performed 31 Oct 2012
Accuracy Suppose the word/name is longer than 5 characters? The
Facts Assume that average uncorrected / corrected OCR character
accuracy is ~89% / ~99% same as CDNC. name name length raw text
accuracy corrected text accuracy Eklund 6 49.7% 94.2% Kennedy 7
44.2% 93.25 Espinosa 8 39.4% 92.3% Bonaparte 9 35% 91.4% Chatterjee
10 31.2% 90.4%
Accuracy name number of search results missing results with raw
text accuracy missing results with corrected text accuracy Eklund
2,951 2,987 182 Kennedy 360,723 455,392 26,111 Espinosa 1,918 2,950
160 Bonaparte 44,664 82,947 4,203 Chatterjee 19 42 2 Chronicling
America searches done 19-Mar-2013 (6,025,474 pages from 1836 to
1922).
I enjoy the correction - its a great way to learn more about
past history and things of interest whilst doing a service to the
community by correcting text for the benefit of others. I have
recently retired from IT and thought that I could be of some
assistance to the project. It benefits me and other people. It
helps with family research. Rose Holley. Many Hands Make Light
Work. National Library of Australia March 2009. motivation Trove
users report
I am interested in all kinds of history. I have pursued
genealogy as a hobby for many years.I correct text at CDNC because
I see it as a constructive way to contribute to a worthwhile
project. Because I am interested in history, I enjoy it. Wesley,
California Personal communications with CDNC text correctors.
motivation CDNC users report
! I only correct the text on articles of local interest -
nothing at state, national or international level, no
advertisements, etc. The objective is to be able to help
researchers to locate local people, places, organizations and
events using the on-line search at CDNC. I correct local news &
gossip, personal items, real estate transactions, superior court
proceedings, county and local board of supervisors meetings,
obituaries, birth notices, marriages, yachting news, etc. Ann,
California Personal communications with CDNC text correctors.
motivation CDNC users report
I have always been interested in history, especially the
development of the American West, and nothing brings it alive
better than newspapers of the time. I believe them to be an
invaluable source of knowledge for us and future generations.
David, United Kingdom motivation CDNC users report Personal
communications with CDNC text correctors.
CDNC is an excellent source of information matching my personal
interest in such topics as sea history, development of
shipbuilding, clippers and other ships etc. ... Unfortunately, the
quality of text ... is rather poor Im afraid. This is why I started
to do all corrections necessary for myself ... and to leave the
corrected text for use of others. .... I am not doing this very
regularly as this is just my hobby and pleasure. Jerzey, Poland
motivation CDNC users report Personal communications with CDNC text
correctors.
As an amateur historical researcher my time for research is
very limited. Making time to travel to archives, libraries, and
historical societies does not happen as often as I would like. The
Cambridge Public Librarys online newspaper collection has been an
invaluable resource and it is fun. I am very grateful for all the
help I have received over the years from so many research
organizations. Correcting text has several benefits. It makes it
much more likely that I will find a story if I decide to search for
it in the future. It is a way of saying thank you to the Cambridge
Library for having such a great resource available and maybe I can
make the next persons research a little easier. It is my own little
historical preservation project. Cambridge Historical Newspapers
Text Corrector motivation Cambridge users report Personal
communications with Cambridge text correctors.
Hard-to-measure-but-shouldnt-be- overlooked (HTMBSBO) benefits
Public domain photo A useful instruction for young sailors from the
Royal Hospital School, Greenwich from the National Maritime
Museum.
when someone transcribes a document, they are actually better
fulfilling the mission of a cultural heritage organization than
someone who simply stops by to flip through the pages HTMBSBO
benefit Paraphrased from Trevor Owens blog
http://www.trevorowens.org/2012/03/
crowdsourcing-cultural-heritage-the-objectives-are-upside-down/
(accessed June 2013).
in addition toincreasing search accuracy or lowering the costs
of document transcription, crowdsourcing is the single greatest
advancement in getting people using and interacting with library
collections HTMBSBO benefit Paraphrased from Trevor Owens blog
http://www.trevorowens.org/2012/03/
crowdsourcing-cultural-heritage-the-objectives-are-upside-down/
(accessed June 2013).
conclusions Conclusion of the Sonata for piano #32, opus 111 by
Ludwig van Beethoven newspaper digitization may be difficult but
there are many, many examples of successful digitization programs.
ask for help! and join the IFLA Newspapers Section! digital
newspaper collections are the most used digital library collections
benefits to crowdsourced text correction and tagging are
multi-faceted: data accuracy, patron engagement, increased web
traffic know your user community!!
Library of Congress National Digital Newspaper Program
http://www.loc.gov/ndnp/ Australian Newspaper Digitisation Program
http://www.nla.gov.au/content/newspaper- digitisation-program IFLA
Newspapers Section Digitisation projects and best practices
http://www.ifla.org/node/6777 ICON: International Coalition on
Newspapers http://icon.crl.edu/digitization.htm
Wikipedia contributors, "List of online newspaper archives,"
Wikipedia, The Free Encyclopedia, https://
en.wikipedia.org/wiki/Wikipedia:List_of_online_newspaper_archives
(accessed March 17, 2013).
Become a member of the IFLA Newspapers Section! See
http://www.ifla.org/ membership or ask me. ! Frederick Zarndt,
Secretary IFLA Newspapers Section
[email protected]
?! Frederick Zarndt Secretary, IFLA Newspapers Section
[email protected] Photo held by John Oxley Library,
State Library of Queensland. Original from Courier-mail, Brisbane,
Queensland, Australia.