Upload
benosteen
View
37
Download
1
Embed Size (px)
Citation preview
Context and collections, and the British LibraryBen O’Steen, British Library Labs
@benosteen
The British Library
Inside the British LibrarySpace for 1200 readers, around 400,000 visitors per year
Uses low oxygen and robotsReading room and delivery to London
Document Supply and Storage at Boston Spa
Stockton-on-TeesAuthor right to payment each time their books
are borrowed from public libraries.
St Pancras, London, UKMany books are stored 4 stories below the buildingLegal Deposit Library – Reference only
Living Knowledge Vision (2015 – 2023)
Custodianship Research Business
Culture Learning International
Document:http://goo.gl/h41wW7 Speech:https://goo.gl/Py9uHK
Roly Keating (Chief Executive Officer of the British Library)
To make our intellectual heritage accessible to everyone, for research, inspiration and enjoyment and be the most open,
creative and innovative institution of its kind by 2023.
Collections – not just books!> 180* million items
> 0.8* m serial titles
> 8* m stamps
> 14* m books
> 3* m sound recordings> 4* m maps
> 1.6* m musical scores
> 0.3* m manuscripts
> 60* m patents
King’s Library *Estimates
Wider…not just Researchers
Researchershttps://goo.gl/WutNyi
Artistshttp://goo.gl/nNKhQ2
LibrariansCurators
https://goo.gl/9NWZUW
Software Developershttps://goo.gl/7QQ5Tf
Archivistshttps://goo.gl/x7b4tg Educators
https://goo.gl/qh01Mi
Digital research methods
Visualisations
Application Programming Interfaces for datasets e.g. Metadata, Images Annotation
Location based searching & Geo-tagging CrowdsourcingHuman Computation
How did we do this?
Competitions
Awards
Projects
Tell us your ideas of what to do with our digital content
Show us what you have already done with our digital content in research, artistic, commercial and learning and
teaching categories
Talk to us about working on collaborative projects
Getting to the heart of it
British Library Labs works with researchers on their specific problems, trying to assess how widely this problem is felt.
With their help, we talk to communities of researchers and try to pinpoint what they need as opposed to what they think they need to ask us.
Researchers often ask for all the content we have.
What does that mean for digitised items in practice?
Taking a peek at our Open Data
A digitised book…
002819694
OCR XML Generated by ABBY Fine Reader
Could Labs provide other ways to understand this book?
Optically Character Recognised (OCR)generated Text
Scanned Page
Image on Flickr Commons
https://goo.gl/AC43vs
Tagging, Tagging, Tagging…
Iterative crowdsourcing?
(The term is borrowed from Mia Ridge.)
1. Crowdsource broad facts and subcollections of related items emerge.
2. No 'one-size-fits-all': Subcollections allow for more focussed curation.
GOTO 1
SherlockNet: Competition Winner 2016Karen Wang, Luda Zhao and Brian Do
Using Convolutional Neural Networks to Automatically Tag and Caption the British Library Flickr Commons 1 million Image Collection
12 categories
>20 million tags added >100,000 captions
bit.ly/sherlocknet
Pooled surrounding OCR text on page
from similar imagesUsed Microsoft COCO (photographs) &
British Museum Prints and Drawingscollections as training sets.
Tags Captions
Artistic / Creative Works
http://goo.gl/dM8ieA
Mario Klingeman (2015)
David Normal 2014 and 2015
Kris Hoffman (2016)
https://goo.gl/QilqqT
Jiayi Chong 2016 Ling Low 2016
https://www.youtube.com/watch?v=bcOP1E5bRE0
https://www.facebook.com/RealmlandStory/ Paul Rand Pierce 2016
A Hat on the Ground Spells trouble
Tragic Looking Women
44 Men who Look 44(Notice the direction faces)
Imaginary Cities – BL Labs Project 16-17Michael Takeo Magruder
https://goo.gl/4ARwTy
An artistic exploration seeking to create provocative fictional cityscapes for the Information Age from the British Library’s digital collection of historic urban maps
Mario Klingemann 2016
https://www.youtube.com/watch?v=xgnxnmqnR7YGoogle Arts and Culture Lab – Experiments with Machine Learning
https://artsexperiments.withgoogle.com/
Mario Klingemann
http://www.robertelliottsmith.com/?p=530
MIT Moral Machine survey:http://moralmachine.mit.edu/
Presentation shapes perception
Creative Uses
• David Normal installation at Burning Man Festival• “Moments” by Joe Bell • Colouring-in Pages for Children
David Normalhttp://www.davidnormal.com/
Burning Man Festival
David Normal created light boxes around theBurning man, using the British Library’s Flickr Images
“Crossroads of Curiosity” (20th June -> November, 2015)
But how can anyone find anything useful?
John Cooper, https://www.flickr.com/photos/atomicshed/2436324958 CC-BY-NC-ND 2.0
Infancy of understandingLarge-scale analysis of text is evolving but young.
Exasperating situation where ‘black boxes’ of algorithms are used to draw conclusions.
http://www.scottbot.net/HIAL/?p=41271
“Black Boxes”:a misnomer
It is legitimate and useful to use code that you could not write.
It is not legitimate to simply believe the ‘label’ on the side of the box.
E.g. “Sentiment Analysis” is often nothing of the sort.
Quoting Scott Weingart: (emphasis mine)● Do sentiment analysis algorithms agree with one another enough to be considered
valid?
● Do sentiment analysis results agree with humans performing the same task enough to
be considered valid?
● Is Jockers’ instantiation of aggregate sentiment analysis validly measuring anything
besides random fluctuations?
● Is aggregate sentiment analysis, by human or machine, a valid method for revealing plot
arcs?
● If aggregate sentiment analysis finds common but distinct patterns and they don’t seem to map
onto plot arcs, can they still be valid measurements of anything at all?
● Can a subjective concept, whether measured by people or machines, actually be
considered invalid or valid?
(again from http://www.scottbot.net/HIAL/?p=41271)
* (2012) https://ariddell.org/where-are-the-novels.html
Digitisation
Often through Partnerships withCommercial & Other Organisations
Bias in digitisation
http://goo.gl/bR9UJL
Sample Generator
Open Licensed Digital Content?
15% Openly Licensed
Around 10%* available online
Working through
Breakdown by collection*Manuscripts 59%Books 9%Maps and Views 7%Newspapers 3%Archives and Records 3%Paintings, Prints and Drawings 2%
*Based on digitisation projects
Largest proportion of fundingPublic / Private Partnership
15%* Openly Licensed85%* Available onsite
*Estimates
Accessing digital collections onsite
OPEN £
•Have to be ‘onsite’
•Need to be security cleared for some collections– Hence ‘Researcher in Residence Model’
•Permission required (depending on ‘story’ of collection)
•Content on various media formats
•20 % re-use of material for non commercial research for some collections
•We are learning ‘pathways’ so that this becomes ‘everyday’ to provide onsite access in the future
Typical pattern of research for Labs
•Finding invisible things in ‘messy’ historical data
•Unearthing / unlocking hidden histories and data to stimulate new research
•Celebrating hidden histories / data creatively through events, art and performance
Finding things in messy OCR text
Mrs Folly• Clean up some manually• Get human ‘ground truth’• Write code to find things
reliably in it automatically• Try code on messy content• Tweak if necessary• Digital ‘lasso’ around content• Human sift through
Mrs Folly
Code: Machine Learning / Reading•Analogies to how humans read / learn
•Machines acquire ‘knowledge’ / data and use that knowledge / data to make sense / identify patterns
•Labs doing this on a case by case basis so methods can vary
•Need computational AND human effort
•Legalities of this process being ‘ironed’ out with publishers,
•Often a misunderstood area…
•Computers look for ‘patterns’ or the ‘essence’ of something
Katrina Navickas (2015) Political Meetings Mapper
http://politicalmeetingsmapper.co.uk https://goo.gl/Qq78Oa
Labs Symposium 2015
https://goo.gl/BSA3be
Interview 2015
The Chartist Newspaper
http://goo.gl/vOLSnH
Chartist Monster Meeting
Chartists Walking Tour and Re-enactment London
Working with NewspaperCollections
Using Jupyter Notebooks
Virtual Infrastructure for OCR text
OCR text scraped from digitised newspapers
and in cloud
Jupyter notebookWrite python code and results
in browserhttp://jupyter.org
Access available for researchers ‘in residence’
Black AbolitionistsIn the UK
Researcher: Hannah Rose Murray
Black Abolitionist Performances & their Presence in Britain (2016) – Hannah-Rose Murray
Aberdeen Journal, 5 February 1851 “Fugitive Slaves”
Aberdeen Journal, 14 April 1847“Frederick Douglass, The Emancipated Slave”
FrederickDouglass
EllenCraft
JosiahHenson
Ida B Wells
A Performance by Joe Williams &
Martelle Edinborough
http://frederickdouglassinbritain.com/
Use of Overproof / OCR Correction?
Re-OCR with ABBY FineReader?
https://www.abbyy.com/en-gb/
http://overproof.projectcomputing.com/
Surveyed a set portion of the collection for words we were interested in, and those 1 and 2 ‘distant’ from these (Levenshtein distance).
Naive-Bayes Classifier:
Classifiers allowed us to prioritise on relevant articles without us reading them:
Data-mining verse in 18th Century newspapersBL Labs Project 16-17, Jennifer Batt
https://goo.gl/5Akthd
Slides courtesy Jennifer BattJennifer Batt @ the BL on World Poetry Day
What thoj' among ourrelves, with too much Heat, or t W: fweutimes.wongle, wvhen we Ihould debate, W – (A confequential Ill which Freedom drawvs, fl t A bad Efficf, but from a noble Caufe) t We can with univeifal Zcal advance, to To cutb the faithlefs Arrogancccof V rance. hi
Dublin Journal 10-14 September, 1745
Slides courtesy Jennifer Batt
Verse: 81% lines begin with initial capital
Prose: 52% lines begin with initial capital
Westminster Journal 3 March 1745
Slides courtesy Jennifer Batt
http://varianceexplained.org/r/kmeans-free-lunch/
In Summary:
- Context about how an digitised image came to be and why it was scanned is both crucial to understand and sometimes crucial to hide.- aka Opening up large collections brings its own issues.
- Presentation shapes perception.- Too much trust in black boxes algorithms, like search
engines or social feed suggestions.- So little of our history is online that there is a natural bias.
The gaps are being filled in with less credible sources.- It still might have happened even if you cannot google
it, and vice versa!
←