23
British Library/JISC – Digital Newspapers ALY CONTEH DIGITISATION PROGRAMME MANAGER BATH, 24 SEPTEMBER 2009

British Library - Digitising Historic Newspapers

Embed Size (px)

DESCRIPTION

Presentation given at Optical Character Recognition (OCR) for the mass digitisation of textual materials: Improving Access to Text workshop held at UKOLN, University of Bath on 24th September 2009

Citation preview

Page 1: British Library - Digitising Historic Newspapers

British Library/JISC – Digital Newspapers

ALY CONTEH

DIGITISATION PROGRAMME MANAGER

BATH, 24 SEPTEMBER 2009

Page 2: British Library - Digitising Historic Newspapers

825 million pages

Page 3: British Library - Digitising Historic Newspapers

BACKGROUND

Funding was secured in 2004 & 2007 from JISC to provide on-line access to a mass of historic newspaper content for learning, teaching and research.

Deliverables Scanning of complete newspaper

runs held by BL 3 million pages of C18 & C19

newspapers stabilised and filmed Article zoning and page extraction OCR of page images Production of required metadata

Page 4: British Library - Digitising Historic Newspapers

PROJECT AIMS

• Free access for the academic community to a content-rich online service

• Access to out-of-copyright UK printed material

• Access to a mix of national and provincial newspapers, the majority from new microfilm

• Access to the entire content of each newspaper via OCR, including adverts, pictures, tables and all articles

Page 5: British Library - Digitising Historic Newspapers

SELECTION AND CONSULTATION

Creation of User Panel of academics UK wide coverage, breadth of

century, national and regional titles

Online questionnaire made and the exercise conducted Feb-Mar 2005

Users asked to rank titles in order of priority

Replies endorsed UK wide coverage; relevant mix of national/ regional titles

Page 6: British Library - Digitising Historic Newspapers

48 titles

3 million pages

10 million articles?

Variations in quality

Variations in structure Daily vs weekly Size Layout

Page 7: British Library - Digitising Historic Newspapers

PHYSICAL CHARACTERISTICS OF SOURCE MATERIAL

Bleed through

Stains

Tight binding

Holes/tears

Creases

Paper quality

Inconsistent inking

Dirt

Stamps

Printer errors

Animals

Repairs

Lamination

Page 8: British Library - Digitising Historic Newspapers

High level view of processes

Original Metadata XML Encoding

Microfilm Digital Images Website/Delivery System

Page 9: British Library - Digitising Historic Newspapers

Delivery: Greyscale v Bitonal

Page 10: British Library - Digitising Historic Newspapers

OCR: Greyscale v Bitonal

Page 11: British Library - Digitising Historic Newspapers

THE OCR CHALLENGE

• Tiny text• Varying formats• Uneven printing• Vertical skew• Multiple columns

Page 12: British Library - Digitising Historic Newspapers

Optical Character Recognition (OCR)

THE VIKING'S SONG

Now skall to the Vikings, the Vikings so bold,So fearless in battle, so famous of old,With swarthy, tanned features, and long locks of gold;Ahoi ! my bold Vikings, ahoi !

We plunder the noble, we plunder the priest,We rob the fat abbot to furnish our feast,There's no fare so fine as the convent-fed beast;Ahoi ! my bold Vikings, ahoi I

What vessels of Venice can vaunt to be lighter?What blades of Toledo can boast being brighter?What man to the Viking can match as a fighter?Ahoi I my bold Vikings, ahoi I

Our sword is our father, our ship is our mother,Our shield is our sister, our breastplate our brother,-Thus, ask us our kindred, we say we've no other;Ahoi ! my bold Vikings, ahoi !

So now slack the ropes, turn the sails to the wind,And smartly the reefs of the canvas unbind,As we sweep o'er the ocean more plunder to find;Ahoi ! my bold Vikings, ahoi !

Exceptional Good Poor Worthless

(Exrh-ads from the New York Papers.)

JACKSON IONEY.It is with great pleasure that we per-ceive the true Jackson money is now iacirculation.. Half eagles of the.newJackson coinage are passing freely from,hand to hand this morning, and all who^get hold of them seem to feel at oncethe superiority of such real money tothe miserable p.laper substitute withlwhich, the spirit of aristocracy wouldstill continue to cheat the people. Thenew eagles, half eagles, and quartersare really beautiful coins-at least sowe ate assured, in relation to the eaglesand quarters, and so we can attestfroux:our own examination, in relation to thehalves. The Globe says, "It is de-voutly to be hoped that the Mint maybe able to suppl, all the pressing de-mands -on it-and .that overy~indepen-dent citizen may obtain a low pieces tocarry and preserve as a charm againstthe sorceries of the mammoth.

I SINGULAR AND SERIOUS ACCIDENT.- -O11 WeU Iwtje enoon Mr. Charles %Vyber, of the Borougll-roadt, VFleet Prison to visit a friend, and joined a party IIIroom, who entered into the foolish a seincitllt of 1pcnny-p)icce to the top of thle room, andl eatchivugtle mothli upon its descent to the lloor. Mr.) Lsidered a perfect adept at this game bot time Ile"last found its way into thle throat, where it V't0tdtwards of half an hour. A Surgeon tried tv folCebut being ulnable to do so, lie contrived to mDOVC tinto the stomach. Mr. Wyyber was commmpariatLvllieved by the penny-piece being riemoved out ot thewas enabiled, in the evening, to be carried tohackney-coach.

la 112 B ik e my lat arrived the>Pylades,-. lliot; aod. Abe- 3ineva, CNeee 4orn Neath,' titch ,cuim; ,'t;ohn_ IoMelwl fri ytiil SUn-.die8; ,FrietndiLp, St&ar, froniidon, 'Ui wine andgrocerieu ;: ;aletn, Bker, from Liverpool,. witfi eoal.;'4Stalled the AluidonG.: ceror' Lkndon, with sundries;: ;Two Rrothwsj'@ Whe~atn-;- Pylade', Eiot; Har'tinny,;;: Fisbley; ::Iiiveiy Peggy:-(flth add tie JAne, Redman,for eathly Newpot;agd llford; -Tw Br.otherAs, lawces,fos Lysixowjvithbinehol V pirI-ihzure;vi etsey, Per-wIliti; iIudstry, ModA - ~tbi ,Al~t,,'enniugs, for.:IP1~iOntI, StIth Ltu .c*ar An'l? Hawkinss foirouck , + iii ballasto I _______~ ~ ~ ~~~Ai

Page 13: British Library - Digitising Historic Newspapers

Key factors affecting OCR accuracy

1. Mass production environment – impossible to hand-tweak every image, compromise between time and quality

2. Software – always improving and developing

3. Quality of text varies within a run – see images

4. Complexity of layouts and formats varies between 48 titles

5. Microfilm source – doesn’t affect this project as the microfilm is of a very good standard, but could in future projects

Page 14: British Library - Digitising Historic Newspapers

Why bother with OCR?

Calculating OCR character accuracy is time consuming and ultimately misleading

Character accuracy vs Word accuracy

Word accuracy vs Significant Words

Why OCR?

Provides smallest level of access into the information

Size of project is such that detailed descriptions in the metadata are impossible

Page 15: British Library - Digitising Historic Newspapers

50.0

55.0

60.0

65.0

70.0

75.0

80.0

85.0

90.0

95.0

100.0

Newspaper Code

characters words significant words words with capital letter start

Page 16: British Library - Digitising Historic Newspapers

They had the internet in 1816 !

The Morning Chronicle (London, England), Saturday, May 18, 1816; Issue 14678

Page 17: British Library - Digitising Historic Newspapers

and a DVD in 1803!

The Morning Chronicle (London, England), Friday, June 10, 1803; Issue 10625

Page 18: British Library - Digitising Historic Newspapers

Why Good Quality OCR Matters

January 1874

Page 19: British Library - Digitising Historic Newspapers

Three ways to access information

By

1. Metadata — title, place of publication, dates of publication, issue number, number of pages, page quality rating, illustration indicator

2. Browsing — article images, page images, browse by issue or title

3. OCR — actual text of page as rendered by automatic OCR process

Page 20: British Library - Digitising Historic Newspapers

Storage

TIFF

Or

JPEG2000

Page 21: British Library - Digitising Historic Newspapers

Costs

British Newspapers 1800-1900 Budget re-categorised to show set up costs 31 August 2006

Setup, 2%

Website, 5%

Salary, 25%

Ongoing Overheads, 5%

Microfilming, 23%

Digitisation, 40%

Digitisation Microfilming Ongoing Overheads Salary Setup Website

Page 22: British Library - Digitising Historic Newspapers

Summary

Access is determined by

– The available technology e.g. OCR, document structure analysis

– By the size of the project – mass production environment is a limitation; no hand tweaking

– By the source material – there are limitations with poor source material

This project has been a trail blazer, complex and challenging.

We have learnt a great deal, to give users better, quicker and fuller access to the content.

Page 23: British Library - Digitising Historic Newspapers

www.bl.uk

[email protected]

Thank you