Digital LibrariesProf. Marcos Andre Goncalves
Universidade Federal de Minas Gerais
SynchronousScholarly Communication
Same time, Same or different place
Asynchronous, Digital Library Mediated Scholarly Communication
Different time and/or place
Digital LibrariesShorten the Chain from
Editor
Publisher
A&I
Library
Reviewer
DLs Shorten the Chain to
Author
Reader
Digital
LibraryEditor
Reviewer
Teacher
Learner
Librarian
DL OverviewWhy of Global Interest?
• National projects can preserve antiquities and heritage: cultural, historical, linguistic, scholarly
• Knowledge and information are essential to economic and technological growth, education
• DL - a domain for international collaboration– wherein all can contribute and benefit– which leverages investment in networking– which provides useful content on Internet & WWW– which will tie nations and peoples together more
strongly and through deeper understanding
Digital Libraries --- Objectives
• World Lit.: 24hr / 7day / from desktop• Integrated “super” information systems: 5S:
Table of related areas and their coverage• Ubiquitous, Higher Quality, Lower Cost • Education, Knowledge Sharing, Discovery• Disintermediation -> Collaboration • Universities Reclaim Property• Interactive Courseware, Student Works• Scalable, Sustainable, Usable, Useful
How is a DL different from a database?
• A traditional SQL database has as its basic element data items in a relation:– select name– from employee, project– where employee.deptnumber = “25” AND– project.number = “100”
• databases exploit known structures and relations
• DBMS retrieval is not probabilistic (Frakes, Baeza-Yates, p. 3)
How is a DL different from the WWW?
• The keyword is managed– The WWW is not managed
• Some meta searchers (Yahoo, Lycos) attempt to add an organizational framework to their web holdings– However, most are focused on keyword
searching (i.e., Google)
How is a DL different from the WWW?
• Another key difference is who controls the input into the system– most meta searchers hunt down their holdings
• Lycos is short for Lycosidae lycosa (the “wolf spider”), which pursues its prey and does not build a web (Mauldin, IEEE Expert, 1/97)
– some (Yahoo) have humans in the loop for review and classification
• To date, DLs are generally more tightly controlled, and have a targeted customer set
DL = Content + Services
“Why not just use the WWW” ?– WWW by itself has low archival
& management characteristics
• “Why not use a RDBMS?”– In the same way that a card
catalog is not a TL, a RDBMS is candidate technology for use in DLs
• DL is the union of the content and services defined on the content
WWW (http) Access
(most common)
non-WWWAccess
(now uncommon)
OtherTechnologies
Digital Library Services
(searching, browsing, citation anlaysisusage analysis, alerts)
Vectorand/or
BooleanSearchEngines
(traditional IR)
RDBMSFile
Systems
Content
How is a DL Different from a Traditional Library?
• TL has as its focus physical objects– even if the card catalog (metadata) is electronic, the
purpose is to point you to a physical location– trafficking in physical objects has both obvious and
subtle implications• object can exist only in 1 place• if you have it, I can’t have it (zero-sum distribution)• I have to go to the object, or wait for it to come to me
TLs vs. DLs
• DLs clearly better than TLs at:– Dissemination, storing information variety
• However, TL objects are more survivable– Who will archive the research information?
• the publishers?• the institutions?• the authors?
– Will the average DL object still be accessible in 10 years?
• take my digital preservation seminar in the spring!
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
image from: http://www.ancientegypt.co.uk/writing/rosetta.html
• Digital Library– removing the physical restriction has obvious
benefits• multiple access, multiple listings, electronic transmission
– also complicates many other issues...• intellectual property, terms and conditions, etc.
• Note that a TL offers additional social and educational benefits– Most TLs also offer hybrid services too.
How is a DL Different from a Traditional Library?
DL Definitions - 1
• “A digital library is an organized and focused collection of digital objects, including text, images, video, and audio, along with methods of access and retrieval, and for selection, creation, organization, maintenance, and sharing of the collection.”
• Witten & Bainbridge – “How to Build a Digital Library” – Morgan Kaufmann 2003
DL Definitions - 2
• “Digital libraries are organizations that provide the resources, including the specialized staff, to select, structure, offer intellectual access to, interpret, distribute, preserve the integrity of, and ensure the persistence over time of collections of digital works so that they are readily and economically available for use by a defined community or set of communities”
• Waters,D.J. CLIR Issues, July/August 1998• www.clir.org/pubs/issues/issues04.html
Informal 5S & DL Definitions
DLs are complex systems that
• help satisfy info needs of users (societies)
• provide info services (scenarios)
• organize info in usable ways (structures)
• present info in usable ways (spaces)
• communicate info with users (streams)
5Ss
Ss Examples Objectives
Streams Text; video; audio; image Describes properties of the DL content such as encoding and language for textual material or particular forms of multimedia data
Structures Collection; catalog; hypertext; document; metadata
Specifies organizational aspects of the DL content
Spaces Measure; measurable, topological, vector, probabilistic
Defines logical and presentational views of several DL components
Scenarios Searching, browsing, recommending
Details the behavior of DL services
Societies Service managers, learners, teachers, etc.
Defines managers, responsible for running DL services; actors, that use those services; and relationships among them
5S and DL formal definitions and compositions (April 2004 TOIS)
5S
structures (d.10)streams (d.9) spaces (d.18) scenarios (d.21) societies (d. 24)
structural metadataspecification(d.25)
descriptive metadataspecification(d.26)
repository(d. 33)
collection (d. 31)
(d.34)indexingservice
structured stream (d.29)
digitalobject (d.30)
metadata catalog (d.32)
browsingservice
(d.37)
searchingservice (d.35)
digital library(minimal) (d. 38)
services (d.22)
sequence (d. 3)
graph (d. 6)function (d. 2)
measurable(d.12), measure(d.13), probability (d.14), vector (d.15), topological (d.16) spaces
event (d.10)state (d. 18)
hypertext(d.36)
sequence (d. 3)
transmission(d.23)
relation (d. 1) language (d.5)
grammar (d. 7)
tuple (d. 4)*
ETANA-DL
• Archaeological DL• Integrated DL
– Heterogeneous data handling
• Applies and extends the OAI-PMH– Open Archives Initiative Protocol for Metadata
Handling
• Design considerations– Componentized– Extensible– Portable
Map courtesy: www.enchantedlearning.com
Initial ETANA-DL Member Locations
Virginia Tech
Mississippi State University
Vanderbilt University
Canadian University College
Walla Walla College
Andrews University
CWRU
Willamette University
Lahav Website
Megiddo Opening Screen
Locus Screen: Pictures
View all
Area Screen
ETANA-DL Approach• Applying and extending Digital Library (DL)
techniques to solve key problems: making primary data available, data preservation, and interoperability
• Modeling archaeological information systems using 5S to better understand the domain and design the system and the supporting services
• Rapidly prototyping DLs that handle heterogeneous archaeological data using componentized frameworks:– eliciting requirements– refining metamodel and union schema– modeling sites– mapping– harvesting– providing useful services
ETANA-DL Website
Marking – writingnotes for
a specific user
Marking Items
Marked Items Display
Sender, Date,Object OAI ID
SenderComments
Options:View Record,
Add record to Items Of Interest,Re-mark item (Redirect),
Unmark item (Remove item from list)
Discussions Page
Discussions about an
object
View/Post messages, create new
threads
Recommendations
Items recommendedon the basis of
similar interests
ETANA-DL Searching ServiceSearch
ETANA-DL Multi-dimensional Browsing
3 new sites
2 new types of artifacts
ETANA-DL Visual Browsing Service
Visual BrowseBy site
Visual Browsing Nimrin: Topographical Drawings
Full site North west quadrant
Square:N40/W20
Visual Browsing Nimrin : Square information
Square:N40/W20
Locus: 86
Loci layout
Visual Browsing Nimrin : locus sheet
Visual Browsing Bab edh-Dhra'
Cemetery
Pottery # 25
Visual Browsing Bab edh-Dhra'
Cemetery
Pottery # 25
ETANA Societies
1. Historic and pre-historic societies (being studied)2. Archaeologists (in academic institutes, fieldwork
settings, or local and national governmental bodies)
3. Project directors4. Technical staff (consisting of photographers,
technical illustrators, and their assistants)5. Field staff (responsible for the actual work of
excavation)6. Camp staff (e.g., camp managers, registrars, tool
stewards)7. General public (e.g., educators, learners, citizens)
ETANA Societies
• Social issues1. Who owns the finds?
2. Where should they be preserved?
3. What nationality and ethnicity do they represent?
4. Who has publication rights?
5. What interactions took place between those at the site studied, and others? What theories are proposed by whom about this?
ETANA Scenarios1. Life in the site in former times2. Digital recording: the planning stage and the excavation stage 3. Planning stage: remote sensing, fieldwalking, field surveys, building
surveys, consulting historical and other documentary sources, and managing the sites and monuments
4. Excavation1. Detailed information is recorded, including for each layer of soil, and for
features such as pole holes, pits, and ditches. 2. Data about each artifact is recorded together with information about its
exact find spot. 3. Numerous environmental and other samples are taken for laboratory
analysis, and the location and purpose of each is carefully recorded. 4. Large numbers of photographs are taken, both general views of the
progress of excavation and detailed shots showing the contexts of finds. 5. Organization and storage of material6. Analysis and hypotheses generation and testing7. Publications, museum displays8. Information services for the general public
ETANA Spaces
1. Geographic distribution of found artifacts2. Temporal dimension (as inferred by
archaeologists) 3. Metric or vector spaces
1. used to support retrieval operations, and to calculate distance (and similarity)
2. used to browse / constrain searches spatially
4. 3D models of the past, used to reconstruct and visualize archaeological ruins
5. 2D interfaces for human-computer interaction
ETANA Structures
1. Site Organization1. Region, site, partition, sub-partition, locus,
…
2. Temporal orderings (ages, periods)
3. Taxonomies1. for bones, seeds, building materials, …
4. Stratigraphic relationships1. above, beneath, coexistent
ETANA Streams
1. successive photos and drawings of excavation sites, loci, unearthed artifacts
2. audio and video recordings of excavation activities and discussions
3. textual reports
4. 3D models used to reconstruct and visualize archaeological ruins.
Streams• Multiple media types and representation
– See ch. 4 for IR (except some here for non-text)– Standards for each, and for some combinations
• Text– Character strings, encoding (Unicode)– Morphology -> Stemming– Syntax, semantics -> stop words– ** POS tagging, phrases
• Images, Audio, Video, Graphics, Animation– Capture, digitization, representation– CBIR for each
• ** Compression, processing, analysis• **Synchronization, rendering, presentation, interchange
– RealVideo, SMIL, QoS
Content BasedInformationRetrieval
Problems
• Image similarity is subjective
– Personal Interpretation
• Concept x Appearance
By Visual features
– Retrieve images with 50 percent of white colour and 50
percent of black colour
Textual information retrieval
Query on Google using Sunset and Rio de Janeiro
Query result
Image Classificationby shape
Image Classification by shape
Work of Torres et al
• Search in collections of fish images
• using combination of
• image properties (CBIR) and
• textual descriptions
Motivation
• Query 1:– List all metadata related to fish which were observed
in the Amazon River• Query 2:
– Retrieve images of fishes whose shape is similar to that in the example
o Query 3: List all metadata related to fishes that were
observed in the Amazon River and whose shape is
similar to that in the example
Motivation
• Retrieve fish descriptions whose shapes are similar to the one shown below, that belong to the “Notropis” genre, that have large yes” e and that have been observed in the “Tennessee River”
Problem• There is no BIodiversity Information System
which allow queries involving :– Geographic data
– Species metadata
– Image Descriptors
• Existing systems:– Metadada or
– Metadada + spatial data
– Images are stored as separate files
• With no possibilty of retrieval by content
WeBioS
Torres: Visualizations
Spiral Pattern
Concentric Rings Pattern
Structures
• Digital Objects– Documents, digitization, packaging (METS), interchange,
standards, format conversion– Genre: plays, encyclopedia, dictionaries, educational resources:
courses (e.g., syllabi) and lessons– Structural organizations (books, chapters, sections),
excerpts/spans (mark, superimposed info)
• Metadata: standards, markup• Knowledge Structures & Representations
– Databases, Schema, Ontologies, Thesauri, Lexicons, Authority files, Concept maps, Semantic networks
• Indexes– Inverted files, signature files, R-trees, Quad trees, etc.
• Clusters & Classification Schemes
Degree of Structure
Chaotic Organized Structured
Web DLs DBs
Digital Objects (DOs)
• Born digital
• Digitized version of “real” object– Is the DO version the same, better, or worse?– Decision for ETDs: structured + rendered
• Surrogate for “real” object– Not covered explicitly in metamodel for a
minimal DL– Crucial in metamodel for archaeology DL
Metadata Objects (MDOs)
• MARC
• Dublin Core
• RDF
• IMS
• OAI (Open Archives Initiative)
• Crosswalks, mappings
• Ontologies
• Topics maps, concept maps
Complex to Simple
MARC ($50) Dublin Core (DC)
+thesis
Spaces
• Retrieval models
– Boolean, extended Boolean
– Vector, LSI
– Probabilistic: classical, belief network, inference network, language models
• User interfaces and visualization
User interfaces and visualization
• 2D interfaces
• 3D interfaces
• GIS
• Other paradigms
Scenarios
• Recall OO for streams – now have objects as well as scenarios – ex interface components
• Information Access– Searching: ad hoc, filtering/routing– Browsing: using an organization, using a visualization,
using links (i.e., hypertext, hypermedia)– Workflow: sessions, feedback, etc.
• Scenario-based Design• Usability: goals, tasks, claims
• NOTE: this is covered in the outline
Societies
• User communities– Authors, editors, teachers, students, readers– Personal(ization), group(ware), community, global– Accessibility, universal access
• Librarians: reference, acquisition, operations• Research community
– Associations, conferences, publications, labs, projects• Economics
– Copyright, intellectual property rights, digital rights management, authorization, authentication, security, privacy, self-archiving (eprints)
– Publishers, catalogers, distributors, sustainability– Open source, commercial, hybrid