22
BABY ELEPHÃT – BUILDING AN ANALYTICAL BIBLIOGRAPHY FOR A PROSOPOGRAPHY IN EARLY ENGLISH IMPRINT DATA NUSHRAT KHAN Oxford-Illinois Digital Libraries Placement Programme

BABY ELEPHÃT – BUILDING AN ANALYTICAL BIBLIOGRAPHY FOR A PROSOPOGRAPHY IN EARLY ENGLISH IMPRINT DATA NUSHRAT KHAN Oxford-Illinois Digital Libraries Placement

Embed Size (px)

Citation preview

Page 1: BABY ELEPHÃT – BUILDING AN ANALYTICAL BIBLIOGRAPHY FOR A PROSOPOGRAPHY IN EARLY ENGLISH IMPRINT DATA NUSHRAT KHAN Oxford-Illinois Digital Libraries Placement

BABY ELEPHÃT – BUILDING AN ANALYTICAL BIBLIOGRAPHY FOR A PROSOPOGRAPHY IN EARLY ENGLISH IMPRINT DATA

NUSHRAT KHAN

Oxford-Illinois Digital Libraries Placement Programme

Page 2: BABY ELEPHÃT – BUILDING AN ANALYTICAL BIBLIOGRAPHY FOR A PROSOPOGRAPHY IN EARLY ENGLISH IMPRINT DATA NUSHRAT KHAN Oxford-Illinois Digital Libraries Placement

2

ABOUT EEBO-TCP- Collaboration between the universities of Oxford and

Michigan from 1999-2015

- Early English Texts between 1473-1700

- 25000 texts made available online

- Full text searching available through EEBO-TCP Database

Page 3: BABY ELEPHÃT – BUILDING AN ANALYTICAL BIBLIOGRAPHY FOR A PROSOPOGRAPHY IN EARLY ENGLISH IMPRINT DATA NUSHRAT KHAN Oxford-Illinois Digital Libraries Placement

3

WHY HISTORIC TEXTS ARE INTERESTING

Historic Datasets

Accessibility

Reveal Historical

Information

Semantic Web

Technical Interoperability

Future Research

Page 4: BABY ELEPHÃT – BUILDING AN ANALYTICAL BIBLIOGRAPHY FOR A PROSOPOGRAPHY IN EARLY ENGLISH IMPRINT DATA NUSHRAT KHAN Oxford-Illinois Digital Libraries Placement

4

WORKSET CONSTRUCTOREnables workset creation from Person, Place, Subject, Genre and Dates parameters (http://eeboo.oerc.ox.ac.uk/)

Page 5: BABY ELEPHÃT – BUILDING AN ANALYTICAL BIBLIOGRAPHY FOR A PROSOPOGRAPHY IN EARLY ENGLISH IMPRINT DATA NUSHRAT KHAN Oxford-Illinois Digital Libraries Placement

5

HOW DOES IT WORK?

Metadata extracted from TEI Data Clean Up Link the Data

Workflow of Publishing Structured Metadata

Available Metadata Fields

• Title

• Author Name

• Date (Precise Birth, Precise Death, precise-floruit-from, precise-floruit-to, precise-floruit-to)

• Raw Publication Place

• Raw Publication Date

• Publisher

Page 6: BABY ELEPHÃT – BUILDING AN ANALYTICAL BIBLIOGRAPHY FOR A PROSOPOGRAPHY IN EARLY ENGLISH IMPRINT DATA NUSHRAT KHAN Oxford-Illinois Digital Libraries Placement

6

SAMPLE PUBLISHER DATAPublisher

By Rycharde Iugge, printer to the Quenes Maiestie,

Printed by I[ohn] C[harlewood] for Iohn Hinde, dwelling in Paules Church-yarde, at the signe of the golden Hinde,

Printed by Benjamin Took and John Crook, and are to be sold by Mary Crook & Andrew Crook ...,

Printed by Peter Smith, and at Saint-Omer at the English College Press],

s.n.],

[By J. Charlewood] for Edward White, dwelling at the little North doore of Paules Church, at the signe of the Gunne,

Imprinted by Richard Field, and are to be sold by Richard Garbrand [, Oxford],

[By I. Jaggard?] for M. S[parke.,

Imprinted by E: G[riffin]: for Iohn Budge, and Ralph Mab,

By [J. King for?] Iohn waley dwellyng in Foster lane,

Page 7: BABY ELEPHÃT – BUILDING AN ANALYTICAL BIBLIOGRAPHY FOR A PROSOPOGRAPHY IN EARLY ENGLISH IMPRINT DATA NUSHRAT KHAN Oxford-Illinois Digital Libraries Placement

7

INSIDE THE DATA

Work

Printed By

Sold At

Printed For

Sold By

Printed At

?

:

.

[ ]

,

…[ ]?

“”,.

Page 8: BABY ELEPHÃT – BUILDING AN ANALYTICAL BIBLIOGRAPHY FOR A PROSOPOGRAPHY IN EARLY ENGLISH IMPRINT DATA NUSHRAT KHAN Oxford-Illinois Digital Libraries Placement

8

WORKFLOW

Data Cleaning

Named Entity Extraction (Person – Printed by, Printed for and Sold by)

Storing Triples and generate RDF

Happy Querying !

Page 9: BABY ELEPHÃT – BUILDING AN ANALYTICAL BIBLIOGRAPHY FOR A PROSOPOGRAPHY IN EARLY ENGLISH IMPRINT DATA NUSHRAT KHAN Oxford-Illinois Digital Libraries Placement

9

ENTITY RECOGNITION APPROACHES

NLTK Entity Extractor

Regular Expression

Page 10: BABY ELEPHÃT – BUILDING AN ANALYTICAL BIBLIOGRAPHY FOR A PROSOPOGRAPHY IN EARLY ENGLISH IMPRINT DATA NUSHRAT KHAN Oxford-Illinois Digital Libraries Placement

10

REVERB

For automatically identifying and extracting binary relationships from English sentences

Input Output

Argument1, Relation Phrase, Argument2Raw text

Bananas are an excellent source of potassium

(bananas, be source of, potassium)

Page 11: BABY ELEPHÃT – BUILDING AN ANALYTICAL BIBLIOGRAPHY FOR A PROSOPOGRAPHY IN EARLY ENGLISH IMPRINT DATA NUSHRAT KHAN Oxford-Illinois Digital Libraries Placement

11

OPEN CALAIS

Not as efficient on short textsi.e. Printed by A. Bells

Input text too short

Example Sentence:Printed by Melchisedech Bradwood for William Aspley

Cannot detect as a person

Page 12: BABY ELEPHÃT – BUILDING AN ANALYTICAL BIBLIOGRAPHY FOR A PROSOPOGRAPHY IN EARLY ENGLISH IMPRINT DATA NUSHRAT KHAN Oxford-Illinois Digital Libraries Placement

12

NLTK ENTITY RECOGNIZER

Step 1 Extracted all the entities labeled as PERSON for each sentence

work_000001|Rycharde Iuggework_000003|Paulswork_000004|Iohn Charlewoodwork_000004|Iohn Hindework_000005|Ioan Danterwork_000006|Francis Grovework_000007|Henry Godduswork_000008|Arthur Iohnsonwork_000012|Leonard Lichfieldwork_000013|Langly Curtiswork_000014|Benjamin Tookwork_000014|John Crookwork_000014|Mary Crookwork_000014|Andrew Crookwork_000015|William Keblewhite

All the entities NLTK can

extract for each record

(with some limitations)

Page 13: BABY ELEPHÃT – BUILDING AN ANALYTICAL BIBLIOGRAPHY FOR A PROSOPOGRAPHY IN EARLY ENGLISH IMPRINT DATA NUSHRAT KHAN Oxford-Illinois Digital Libraries Placement

13

LIMITATIONS OF NLTK

• NLTK does not identify initials as names, i.e A. B.

• Extracts only the surname in the expressions like A. Bells, Edw: Allde

• Identifies the word “Printer” in sentences where it’s mentioned in capital letters after ‘by’. i.e Printed by John Bill, Printer to the King's most Excellent Majesty

• In case of complex sentences containing multiple names it cannot detect and extract all the names efficiently

Page 14: BABY ELEPHÃT – BUILDING AN ANALYTICAL BIBLIOGRAPHY FOR A PROSOPOGRAPHY IN EARLY ENGLISH IMPRINT DATA NUSHRAT KHAN Oxford-Illinois Digital Libraries Placement

14

FINDING RELATIONSHIPS WITHIN SENTENCES("'Printed", 'JJ')('by', 'IN')(PERSON Benjamin/NNP Took/NNP)('and', 'CC')(PERSON John/NNP Crook/NNP)('and', 'CC')('are', 'VBP')('to', 'TO')('be', 'VB')('sold', 'VBN')('by', 'IN')(PERSON Mary/NNP Crook/NNP)('&', 'CC')(PERSON Andrew/NNP Crook/NNP)

Look for preceding

preposition

Separate the entities based on

‘by’ or ‘for’

“You're having a hard time because it's hard. This is really not an easy task to approach. – jonrsharpe Jul 31 '14"

Page 15: BABY ELEPHÃT – BUILDING AN ANALYTICAL BIBLIOGRAPHY FOR A PROSOPOGRAPHY IN EARLY ENGLISH IMPRINT DATA NUSHRAT KHAN Oxford-Illinois Digital Libraries Placement

15

DATA REFINING

Printed & Sold by Sold

ByPrinted

By

Printed For

De-duplicate the ‘Sold by’ Put back the

ones in ‘Printed and Sold by’

Extracted separately using ‘Regex’

Page 16: BABY ELEPHÃT – BUILDING AN ANALYTICAL BIBLIOGRAPHY FOR A PROSOPOGRAPHY IN EARLY ENGLISH IMPRINT DATA NUSHRAT KHAN Oxford-Illinois Digital Libraries Placement

16

GENERATING UNIQUE URI

Ideal case : Assign unique URI to the same person

Exception in this case:

• Few authoritative sources to refer to

• Time consuming validation

• Very limited information about each person available

Assigned unique URI to every instance

Python uuid module – uuid4() function

Page 17: BABY ELEPHÃT – BUILDING AN ANALYTICAL BIBLIOGRAPHY FOR A PROSOPOGRAPHY IN EARLY ENGLISH IMPRINT DATA NUSHRAT KHAN Oxford-Illinois Digital Libraries Placement

17

WORKING WITH ONTOLOGY

Checked existing ontologies for ‘Printed by’ and ‘Printed for’ relationships --- MODS, MADS, BibFrame etc

EEBOO Ontology

Modify the existing ontology to define the new relationships

Work

Author

Printed By

Printed For

Sold By

Page 18: BABY ELEPHÃT – BUILDING AN ANALYTICAL BIBLIOGRAPHY FOR A PROSOPOGRAPHY IN EARLY ENGLISH IMPRINT DATA NUSHRAT KHAN Oxford-Illinois Digital Libraries Placement

18

STORING TRIPLES AND GENERATING RDF

Page 19: BABY ELEPHÃT – BUILDING AN ANALYTICAL BIBLIOGRAPHY FOR A PROSOPOGRAPHY IN EARLY ENGLISH IMPRINT DATA NUSHRAT KHAN Oxford-Illinois Digital Libraries Placement

19

QUERYING ON THE DATA 1Top 20 Publishers Top 20 Printed for Top 20 Sold By

Page 20: BABY ELEPHÃT – BUILDING AN ANALYTICAL BIBLIOGRAPHY FOR A PROSOPOGRAPHY IN EARLY ENGLISH IMPRINT DATA NUSHRAT KHAN Oxford-Illinois Digital Libraries Placement

20

QUERYING ON THE DATA 2

The sellers for the works published by Henri Hills

Both Printed and Sold by Henri Hills

Sellers who worked with Henri Hills-Will Larner, Jane Underhill, Francis Smith

Page 21: BABY ELEPHÃT – BUILDING AN ANALYTICAL BIBLIOGRAPHY FOR A PROSOPOGRAPHY IN EARLY ENGLISH IMPRINT DATA NUSHRAT KHAN Oxford-Illinois Digital Libraries Placement

21

FUTURE DIRECTION

• Train NLTK to capture the names properly

• Extract specific place names from the publisher field. i.e. sold at Golden Hinde

• In case of initials figure out how to identify the names, i.e. whether R. Charles is Robert Charles or Ruth Charles etc. May be request help from domain expert

• Analyze how name expressions have changed over time

• Identify the authors using authoritative sources and domain specific knowledge, i.e. London Book Trades Index, British Book Trade Index

• Analyze and visualize the data by mapping

Page 22: BABY ELEPHÃT – BUILDING AN ANALYTICAL BIBLIOGRAPHY FOR A PROSOPOGRAPHY IN EARLY ENGLISH IMPRINT DATA NUSHRAT KHAN Oxford-Illinois Digital Libraries Placement

22

GRATITUDE

• Terhi Nurmikko-Fuller

• David M. Weigl

• Professor David De Roure

• Kevin Page

• Pip Willcox

And everybody else at OeRC!