46
Digitising Hansard Edward Wood Director of Information Management House of Commons 16.6.08

Digitising Hansard

  • Upload
    aliss

  • View
    593

  • Download
    1

Embed Size (px)

DESCRIPTION

Digitising Hansard Edward Wood, Director of Information Management, House of Commons.

Citation preview

Page 1: Digitising Hansard

Digitising Hansard

Edward WoodDirector of Information Management

House of Commons16.6.08

Page 2: Digitising Hansard

digitising Hansard

• digitising Hansard: scanning and OCR• the policy context• database and front end

Page 3: Digitising Hansard

Hansard

• the official report of debates in Parliament• actually an unofficial private enterprise at first• “nationalised” in 1909• early reports written in the third person• eventually developed into a (nearly) verbatim

account• volumes from 1803 – 2005 were digitised• nearly 3 million pages

Page 4: Digitising Hansard

“though not strictly verbatim, [it] is substantially the verbatim report, with repetitions and redundancies omitted and with obvious mistakes corrected, but [...] on the other hand leaves out nothing that adds to the meaning of the speech or illustrates the argument.”

Page 5: Digitising Hansard

why digitise?

• enable preservation• conservation is expensive• increase access• increase usability• improve business processes• re-use physical storage space• costs have fallen significantly• quality improving steadily

Page 6: Digitising Hansard

preservation vs. conservation

conservationdirect intervention to prevent/make good damage to materials

preservationa broader term than conservation. It includes all managerial and financial considerations including storage and accommodation provision, staffing levels, policies, techniques, and methods involved in preserving library and archive materials and the information contained therein

Page 7: Digitising Hansard

preservation

• originals printed on poor quality paper• starting to deteriorate• reduce wear and tear from daily use• keep in a controlled environment• conservation is expensive

Page 8: Digitising Hansard

improve access• internal

– extensive day to day business use across a very large site

• public– national heritage and birthright– disposal by libraries– international interest

Page 9: Digitising Hansard

increase usability• search• print• share• novel uses/mash-ups

quality of digitisation techniques

improving steadily

Page 10: Digitising Hansard

costs

• costs have fallen significantly• alternative funding models• reduce physical storage needs

– dispose of surplus copies– locate in less valuable space

• but beware the hidden costs…

Page 11: Digitising Hansard

ongoing costs

• developing a front-end and database• hosting• storing images• digital preservation• format migration

Page 12: Digitising Hansard

alternatives

• microfilm• conservation

• facsimile

Page 13: Digitising Hansard

why not leave it to the big boys?

in a word, control• subject matter• quality• value added• use

Page 14: Digitising Hansard

funding models

• self-funding• commercial funding• joint funding• grants

Page 15: Digitising Hansard

doing the work

• In house or contractor?• method

– image only– re-keying (single, double, triple...)– OCR (optical character recognition)– image plus text – metadata capture– manual intervention increases quality and costs!

Page 16: Digitising Hansard

scanning from...

• microfilm• loose originals• bound originals• dis-bound originals

Page 17: Digitising Hansard

OCR

• how accurate does it need to be?• mass vs batch capture • double or triple compare• diminishing returns

Page 18: Digitising Hansard

QA (quality assurance)

• automate where possible• contractor

– 100% proof reading

• client– heavy sampling of images– 1% sampling of text

• third party?

Page 19: Digitising Hansard

the need for a policy framework

• Hansard was the first major digitisation project in the UK parliament

• an earlier project to digitise Local and Private Acts captured images only

• we needed a digitisation policy for parliament to ensure consistency and learning from experience

Page 20: Digitising Hansard

policy aims

• ensure that individual projects:– take into account the wider information context both

inside and outside Parliament– deliver their target benefits– offer value for money

• ensure the resources created can be:– exploited fully– used for as long as is required

Page 21: Digitising Hansard

policy scope

• publications• photographs• archival documents• business records

Page 22: Digitising Hansard

policy principles

• digitisation needs to be seen as an integral part of the information work carried out by parliament

• use of appropriate technical standards• scan once for many purposes• business cases should take account of all

relevant costs

Page 23: Digitising Hansard

selection criteria

• measurable user demand (for public use)• business need (for internal use)• the potential for learning and educational use• cost and the availability of other resources• technical considerations • the uniqueness of the items • conservation requirements• intellectual property rights and copyright issues • the availability of digitised versions of the same material

elsewhere• the potential for revenue raising • the feasibility of long-term preservation, where required

Page 24: Digitising Hansard

other aspects of the policy

• the delivery method will be planned at the outset• the preservation master will be an

uncompressed TIFF file• metadata will be created, to support resource

discovery, use, storage and digital preservation• we will adopt international standards where

possible• we will work with partners where possible

Page 25: Digitising Hansard

developing a digitisation strategy

• a project board has been created• an integral part of an online parliamentary

history programme for parliament• will use the criteria set out in the digitisation

policy to prioritise future digitisation work

Page 26: Digitising Hansard

practical guidelines

• guidelines have been developed for all parts of parliament which need to create digitised assets:– a checklist for doing the work– glossary– details of file formats, OCR options– describes popular myths on costs

Page 27: Digitising Hansard

hosting

• text and images• text only• navigation• search• web 2.0• funding models• give it away!?

http://www.parliament.uk/publications/archives.cfm

Page 28: Digitising Hansard

developing a web interface

drivers• keep costs down• work closely with users • meaningful search across a large amount of data

solution• experimental approach• open source

Page 29: Digitising Hansard

methodology and progress

• small team of developers from Parliamentary ICT working closely with users (inside and outside Parliament)

• uses “micro formats” approach • XML is parsed into HTML before loading into the

database• JPEGs not currently being used• half of the data has been loaded (mainly 20th

century)• public discussion group and issues log

Page 30: Digitising Hansard

http://hansard.millbanksystems.com

Page 31: Digitising Hansard
Page 32: Digitising Hansard
Page 33: Digitising Hansard
Page 34: Digitising Hansard

faceted classification

• faceted approach to browsing and searching • assignment of multiple classifications to an object• classifications can be to be ordered in a variety of ways• facets include

– date– volume number– monarch– chamber– content type (debates or questions)– constituencies – Members of Parliament – offices held.

Page 35: Digitising Hansard

other features

• references using the standard format can be located using the search box HC Deb Vol 385 13 May 2002 c498

• predictable URLs

http://hansard.millbanksystems.com/commons/1941/may/07/war-situation

• pages created for:– individual Members of Parliament – constituencies – acts– bills– divisions

Page 36: Digitising Hansard
Page 37: Digitising Hansard
Page 38: Digitising Hansard
Page 39: Digitising Hansard
Page 40: Digitising Hansard
Page 41: Digitising Hansard
Page 42: Digitising Hansard
Page 43: Digitising Hansard
Page 44: Digitising Hansard
Page 45: Digitising Hansard
Page 46: Digitising Hansard

http://hansard.millbanksystems.com