Geri Ingram
Community Manager
OCLC CONTENTdmÂŽWorking with Text and PDFs
Spring 2015 CONTENTdm User Conference
Goucher College
Baltimore MD
May 27, 2015
The worldâs libraries. Connected.
⢠To get the most out of this session
⢠Either you have:
⢠Experience building CONTENTdm collections
OR
⢠Attended recent CONTENTdm Training
Intended audience
The worldâs libraries. Connected.
⢠Context-setting⢠Data
⢠Filetypes
⢠Organization
⢠Naming
⢠Collection Configuration⢠Adding text-rich digital items
⢠PDFs, singly and in batch⢠Image files âMonographs (singly), Docs, Postcards singly and
in batch
⢠QA
Agenda
The worldâs libraries. Connected.
ContextYour UsersYour CollectionsDiscovery & Delivery
The worldâs libraries. Connected.
⢠Primary user storiesâwho are your users?
⢠What are the text-rich materials you have?
⢠YOUR users needs drive decisions about content and access methods
⢠YOUR data drive decisions about collection building methods
⢠Which tools are appropriate for your materials?
⢠What do different wizards expect by way of file naming and organization?
Context: Your users, Your materials
The worldâs libraries. Connected.
What are your users looking for?Yearbooks, newspapers, ETDsâŚ
The worldâs libraries. Connected.
Historical postcardsâŚ
The worldâs libraries. Connected.
Archival papersâŚ
The worldâs libraries. Connected.
⢠Browsing
⢠Searching--across collections, subgroups?⢠Known item searching, and/or
⢠Total recall by topic, name, etc.
⢠Your users expect searchable, text-rich materials⢠full-text search-ability across the repository, the site, the world.
How are they looking for these materials?
The worldâs libraries. Connected.
⢠Metadata is key for discovery⢠All fields may be made searchable
⢠Full-text is quickly becoming necessary for discovery as well as delivery
⢠Linked data will eventually provide the engine for new knowledge creation
Curated AND indexed
The worldâs libraries. Connected.
Meet users where they are
⢠Simple guidelines to help⢠Website Config tools
⢠Primary URL
⢠Automated site maps
⢠Moving toward linked data ⢠Built from standard vocabularies
The worldâs libraries. Connected.
⢠Search engines ⢠Sitemaps, persistent identifiers, etc.
⢠Schema.org
⢠Becoming visible in the web-of-things⢠Harvesting metadata
⢠WorldCat
⢠DPLA
Visibility
The worldâs libraries. Connected.
Data:File typesOrganizationNaming
The worldâs libraries. Connected.
⢠Papers, videos, audio files
⢠In CONTENTdm, these are natively simple items, not compound objects, e.g.:
⢠.pdf
⢠.mp3
⢠.avi
Formats: Born-digital
The worldâs libraries. Connected.
⢠If still to be digitized⢠You have control over the project specification
⢠File name and organization
⢠Metadata automatically and manually created
⢠If already digitized⢠You choose among the tools for the one that best fits your data
organization
Formats: Digitized (reformatted)
The worldâs libraries. Connected.
⢠De facto standard for documents⢠Can have embedded text
⢠Portable
⢠Responsive design must include a PDF reader for market demands
⢠Preservation metadata formats
⢠A simple format, can ingest through CONTENTdm Web add as well as through Project Client.
Adobe PDF
The worldâs libraries. Connected.
⢠CONTENTdm defined classes âwhen 2 or more simple items are bound together by logic (and XML):
⢠Documentsââflatââa series of related items
⢠Postcardsâexactly two digital files; two-sided items
⢠Monographsââhierarchicalââitems related in a hierarchy
⢠Six-sided viewsâexactly six digital files (known as âpicture cubeâ)
CONTENTdm Compound Objects
The worldâs libraries. Connected.
⢠Remember: metadata fields can be made searchable or not
⢠In addition, full-text, extracted from the digital object itself can be stored in a metadata field in any of three ways:
1. Generated by OCR âon-the-flyâ (integrated ABBYY FineReaderÂŽ)
2. Imported as .txt transcript
⢠Typescripted from handwritten manuscripts
or
⢠OCRâd in advance (external OCR engine)
3. Extracted (by server) from PDFs (if text has been created from the image to begin with)
Providing searchable text from image files
The worldâs libraries. Connected.
Configuration:Collection, Project, &Website
The worldâs libraries. Connected.
1. Examine the folders and files2. Access the collection administration page3. Configure PDF conversion option 4. Add appropriate metadata fields for your data
1. A searchable field with a Full text search data type2. A Tag field that is searchable and hidden.
5. Use metadata templates as much as possible1. Page level 2. Compound object level3. PDFs
Configure Collection and Project
The worldâs libraries. Connected.
⢠If your materials have searchable text, you will need
Collection field:
One empty, searchable field configured as âFull text searchâ data type to hold text
⢠For âtopâ level records only ⢠Website Config tool:
⢠to suppress display of components of compound objects in search results.
⢠Export via OAI-PMH
Collection and Website Configuration choices
The worldâs libraries. Connected.
Website Config options for PDF display
The worldâs libraries. Connected.
Adding text-rich materials
PDFsImages
The worldâs libraries. Connected.
⢠Collection configuration option:⢠âConvert PDF to compound objectâ
⢠What it does and does NOT do.
⢠How/when you might override it
⢠Effect on the end-usersâ view⢠A Multi-page PDF will call compound object viewer
If it has been processed as if it were a compound object
⢠A one-page PDF will ignore the setting and call the item viewer to display
Processing a .pdf to optimize indexing, search and display
RememberâPDFs are simple files that can be converted to compound objectsâstill counted AND added, as simple items
The worldâs libraries. Connected.
⢠It DOES allow very large pdf files to be indexed, searched and retrieved quicklyâEACH page can have 128,000 characters.
⢠It DOES allow end users to search for text across huge volumes of materials.
⢠It DOES allow the end-user to choose to view the PDF by thumbnail, by contents, with View PDF and text*, or through Page flip.
⢠It does NOT allow you to ânestâ compound objects. I.e., you can assemble multiple PDFs as a compound object, but you cannot then take advantage of the page-level indexing, display etc., within each âpageâ of the compound object.
What PDF conversion does and does NOT do.
The worldâs libraries. Connected.
PDF importingExplain and Demonstrate:
The worldâs libraries. Connected.
⢠All from one folder
⢠Several from same folder
Add items, multiple items
The worldâs libraries. Connected.
Images:Compound Objects
Explain and Demonstrate:
The worldâs libraries. Connected.
⢠Definitions: ⢠Compound Objectâseries of 2 or more items assembled together
⢠WizardââAdd Compound Objectâ
⢠(single) or Multiple ( either Object List or Directory Structure)
⢠Type of Object (Document, Monograph, Post Card, Picture Cube)
⢠Method leverages data in hand (with or without tab-delimited metadata)
⢠Materials commonly assembled as compound objects, e.g., ⢠Yearbooks, Papers, Postcards, Books
File organization and naming
Preparing to use the Project Client wizards
The worldâs libraries. Connected.
⢠For all compound object types⢠Document, Monograph, Postcard, Picture cube
⢠For each compound object
⢠All digital files must reside in one directory/folder
⢠This is true whether you are adding multiple compound objects or a single compound object.
⢠And with multiple objects, all must be of same type
File and folder facts:regardless of wizard to be used in Add compound object function, SCANS are held together in one folder
The worldâs libraries. Connected.
Example: a single Document using Add compound object
Structure by folder organization
The worldâs libraries. Connected.
Example: a single Monographusing Add compound object
Where structured byfolder organization
Where structured by a tab-delimited text file
The worldâs libraries. Connected.
When you add multiple compound objects using tab-d files:Their nature and placement changes
Got page-level metadata?Each object needs its own.
Got only object-level metadata?All objects share one.
The worldâs libraries. Connected.
Compound objects using tab-d files, depending upon the Object class:The structure of the .txt file itself changes
Document: all âcolumnsâ are field attributes
Monograph: two new âcolumnsâ define structure
The worldâs libraries. Connected.
⢠Monographsâwith two methods for structure and transcripts⢠Will you OCR to get transcript?
⢠Method: directory (folder) structure:
⢠Yearbook (DEMO only OCRâd transcripts produced on the fly)
⢠Do you have transcripts in .txt file for each image?
⢠Method: tab-d file structure
⢠Book (Separate transcript externally produced)
Demonstrate: (single) Monograph Compound objectWizard: Compound Object
The worldâs libraries. Connected.
⢠Documentsâ⢠Using Object List with .txt for each Scan directory
⢠Typescript Letters with page-level metadata IN .txt files (Horowitz archives)
⢠Postcardsâ⢠Using Directory structure
⢠Hand-written â some with typescripts, some without
Demonstrate: Compound ObjectsTwo wizardsâeach leverages the data
The worldâs libraries. Connected.
⢠Administration/Editâone-at-a-time structure, obj-level metadata
⢠Project Client/Find in Collection (search or browse)⢠Batch or single, all structure and metadata can be edited
⢠Edit in one or more ways, depending upon the data⢠Move, replace, delete, add pages (only âtrueâ compound
objects)
⢠Edit a transcript; find and replace across pages
⢠Save, upload, re-approve and re-index
Demonstrate:Edit Compound Objects
The worldâs libraries. Connected.
⢠Getting help with compound objects⢠User Support Center
⢠Tutorials to study
⢠Installing, activating the OCR extension
⢠Help files related to text works
⢠Office hours twice monthly
⢠Write [email protected]
Questions & Answers
The worldâs libraries. Connected.
⢠Geri Ingram
Questions?