17
The task of cleaning and enriching large collections what aspects can we share?

The task of cleaning and enriching large collections

  • Upload
    zoltin

  • View
    49

  • Download
    0

Embed Size (px)

DESCRIPTION

The task of cleaning and enriching large collections. what aspects can we share?. C ontributing to this work UIUC English: Ted Underwood Jordan Sellers Mike Black UIUC Library: Harriett Green I3: Loretta Auvil Boris Capitanu Andrew W. Mellon Foundation. - PowerPoint PPT Presentation

Citation preview

Page 1: The task of cleaning  and enriching  large collections

The task of cleaning and enriching

large collectionswhat aspects can we share?

Page 2: The task of cleaning  and enriching  large collections

Contributing to this work

UIUC English:Ted UnderwoodJordan SellersMike Black

UIUC Library:Harriett Green

I3:Loretta AuvilBoris Capitanu

Andrew W. Mellon Foundation

Page 3: The task of cleaning  and enriching  large collections

“Enrich” as well as “clean.”

Page 4: The task of cleaning  and enriching  large collections

Yearly values of a ratio between two wordlists in three different genres. 4,275 volumes. 1700-1899.

Page 5: The task of cleaning  and enriching  large collections
Page 6: The task of cleaning  and enriching  large collections

“representative?”

Page 7: The task of cleaning  and enriching  large collections

analyzing the data

cleaning the data

Page 8: The task of cleaning  and enriching  large collections

“clean” is relative

Page 9: The task of cleaning  and enriching  large collections

different projects will strike a different balance between

precision and recall

makes it tricky to share resources

Page 10: The task of cleaning  and enriching  large collections

Cleaning the data1. Clean up the OCR / assess error.2. Identify parts of a volume (e.g.,

articles in a serial, poetry/prose).3. Remove library bookplates and

running headers — after using them for (3).

Page 11: The task of cleaning  and enriching  large collections
Page 12: The task of cleaning  and enriching  large collections

period-specific lexica incl. foreign

lang.

collection-level observations:

proper nouns,words that appear mainly in dirty docs

context of an individual doc:

It is furely a mortal fin to ...

Correction rules

Page 13: The task of cleaning  and enriching  large collections

Cleaning/enriching the metadata

1. “18??”2. Discard duplicate volumes / select early

editions?3. Add metadata that you need for

interpretive purposes, like— gender (see Ben Schmidt’s technique),— genre.

Page 14: The task of cleaning  and enriching  large collections

first stab at genre – naive Bayes

Page 15: The task of cleaning  and enriching  large collections

Things we could shareperiod lexicons / variant spellingsgazetteers of proper nounsOCR correction rules for a perioddocument segmentation and/or cleaned and segmented textferberizationcleaned / enriched metadatacode to do all of the above

Page 16: The task of cleaning  and enriching  large collections

get clues from metadata

break vols into parts

ensemble / boosting

active learning

Page 17: The task of cleaning  and enriching  large collections

active learning: documents classified as “fiction,” plotted by confidence in classification (y axis). Red

points are misclassified.