62
11/15/2001 Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management and Systems SIMS 202: Information Organization and Retrieval Slides by Ray R. Larson and Robert Glushko

11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

  • View
    217

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

Information Structures and Metadata

University of California, Berkeley

School of Information Management and Systems

SIMS 202: Information Organization and Retrieval

Slides by Ray R. Larson and Robert Glushko

Page 2: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

Review

• Controlled Vocabularies– Authority Control and Name Authorities– Subject Headings vs. Descriptors– Hierarchical vs. Facetted Organizations

Page 3: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

Metadata• Metadata is:

– “data about data” (from Database)– Information about Information– Structures and Languages for the Description of

Information Resources and their elements (components or features)

– “Metadata is information on the organization of the data, the various data domains, and the relationship between them” (Baeza-Yates p. 142)

Page 4: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

Type of Metadata systems and standards

• Naming and ID systems – URLs, ISBNs• Bibliographic description – MARC, Dublin Core,

TEI, etc.• Music -- SMDL• Images and objects – CIMI, VRA Core Categories• Numeric Data – DDI, SDSM• Geospatial Data – FGDC • Collections – EAD

Page 5: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

Controlled Vocabularies

• Vocabulary control is the attempt to provide a standardized and consistent set of terms (such as subject headings, names, classifications, etc.) with the intent of aiding the searcher in finding information.

Page 6: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

The problem

• Proliferation of the forms of names– Different names for the same person– Different people with the same names

• Examples – from Books in Print (semi-controlled but not

consistent)– ERIC author index (not controlled)

Page 7: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

Name Authority Files ID:NAFL8057230 ST:p EL:n STH:a MS:c UIP:a TD:19910821174242 KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF:05-14-80 RFE:a CSC: SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD: VST:d 08-21-91 Other Versions: earlier 040 DLC$cDLC$dDLC$dOCoLC 053 PR6005.R517 100 10 Creasey, John 400 10 Cooke, M. E. 400 10 Cooke, Margaret,$d1908-1973 400 10 Cooper, Henry St. John,$d1908-1973 400 00 Credo,$d1908-1973 400 10 Fecamps, Elise 400 10 Gill, Patrick,$d1908-1973 400 10 Hope, Brian,$d1908-1973 400 10 Hughes, Colin,$d1908-1973 400 10 Marsden, James 400 10 Matheson, Rodney 400 10 Ranger, Ken 400 20 St. John, Henry,$d1908-1973 400 10 Wilde, Jimmy 500 10 $wnnnc$aAshe, Gordon,$d1908-1973

Different names for thesame person

Page 8: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

Name Authority FilesID:NAFO9114111 ST:p EL:n STH:a MS:n UIP:a TD:19910817053048 KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF:06-03-91 RFE:a CSC:c SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD: VST:d 08-19-91 040 OCoLC$cOCoLC 100 10 Marric, J. J.,$d1908-1973 500 10 $wnnnc$aCreasey, John 663 Works by this author are entered under the name used in the item. For a listing of other names used by this author, search also under$bCrease y, John 670 OCLC 13441825: His Gideon's day, 1955$b(hdg.: Creasey, John; usage: J .J. Marric) 670 LC data base, 6/10/91$b(hdg.: Creasey, John; usage: J.J. Marric) 670 Pseuds. and nicknames dict., c1987$b(Creasey, John, 1908-1973; Britis h author; pseud.: Marric, J. J.)

Page 9: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

Name Authority FilesID:NAFL8166762 ST:p EL:n STH:a MS:c UIP:a TD:19910604053124 KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF:08-20-81 RFE:a CSC: SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD: VST:d 06-06-91 Other Versions: earlier 040 DLC$cDLC$dDLC$dOCoLC 100 10 Butler, William Vivian,$d1927- 400 10 Butler, W. V.$q(William Vivian),$d1927- 400 10 Marric, J. J.,$d1927- 670 His The durable desperadoes, 1973. 670 His The young detective's handbook, c1981:$bt.p. (W.V. Butler) 670 His Gideon's way, 1986:$bCIP t.p. (William Vivian Butler writing as J .J. Marric)

Different people writing with the same name

Page 10: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

Other Types of Controlled Vocabularies

• Gazetteers (Geographic Names)

• Code lists (e.g. LC Language Codes)

• Subject Heading Lists

• Classification Schemes

• Thesauri

Page 11: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

Today

• SGML

• XML

• DTDs

• Document Markup

• Uses of XML

Page 12: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

SGML & XML

• What is SGML/XML?

• Document Type Definitions

• Document Markup

• Sources and Resources

Page 13: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

What is SGML/XML?

• A. SGML stands for Standard Generalized Markup Language– XML stands for eXtended Markup Language

• B. What it is NOT:– Not a visual document description– Not an application specific markup– Not proprietary

Page 14: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

What is SGML/XML?• What it is:

– An international standard (SGML- ISO 8879:1986)

– A generic language for describing the structure of documents, and markup that can be used for those documents

– Intended for generating markup for content rather than form elements

• XML is a simplified subset of SGML (being established by W3C)

Page 15: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

The Documents of Commerce• Customer Profiles• Vendor Profiles• Catalogs• Datasheets• Price Lists• Purchase Orders• Invoices• Inventory Reports

• Bill of Materials• Contracts• Credit Reports• Bank Statements• Proposals• Directories• Transportation Schedules• Receipts

Source Dr. Robert J Glushko

Page 16: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

Alternatives for Exchanging Documents

Format

based

API

based

Publish information for a universalclient

Batch and high-volumeexchangebetween tradingpartners

Application Integration

HTML EDI CORBA / COM

Source Dr. Robert J Glushko

Page 17: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

Limitations of each Exchange Model

Format

based

API

based

Formattingmarkup “for eyes”

“Scrape and hope” integration

Must bepre-arranged

High cost

Rigid and inflexible

Pre-wired

Heavyweightto implement

Not native to the web

HTML EDI CORBA / COM

Source Dr. Robert J Glushko

Page 18: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

Having our Cake and Eating it Too

We need:

• the precision of APIs• the simplicity of HTML

Source Dr. Robert J Glushko

Page 19: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

XML to the Rescue (SGML-- and HTML++)

• Extensible Markup Language– a simplification of SGML, the Standard Generalized

Markup Language – instead of a fixed set of format-oriented tags like

HTML, XML allows you to create the schema -- whatever set of tags are needed --for your information type or application

– this makes any XML instance “self-describing” and easily understood by computers and people

• Version 1.0 ratified by W3C in 2/98; backed by Microsoft, Sun, Netscape, many others

Source Dr. Robert J Glushko

Page 20: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

Why XML is Revolutionary• XML enables a business to preserve any

“document type” or “database schema” when it publishes on the Web

• XML enables a business to send self-describing “business messages” that can be understood by programs, not just “by eye”

• This information cannot be encoded in HTML• XML-encoded information is smart enough to

support new classes of Web applications

Source Dr. Robert J Glushko

Page 21: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

XML Enables New Web Applications

• Data interchange between Web clients– use Web for application integration without

information loss (example: product information in supply chain, EDI)

• Moving processing from server to client– reduce network traffic and server load

(example: download airline schedule, find best flights without “back-and-forth” thrashing)

Source Dr. Robert J Glushko

Page 22: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

XML Enables New Web Applications

• Multiple client-side views of same data– expert and novice versions– manager and worker versions– localization (currency or measurement

conversions)• “Information push” from personalized

applications– selecting information based on user

preferences (example: custom news feed by matching article keywords against user profile)

Source Dr. Robert J Glushko

Page 23: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

The First Generation Web

Computers Browsers

.. making information accessible through browsers

scripts

HTML

Eyeballs onlyNo automationLimited integration

Source Dr. Robert J Glushko

Page 24: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

HTML Airline Schedule Seen “By Eye”

Airline Schedule Flight Information United Airlines #200 San Francisco 9:30 AM Honolulu 12:30 PM $368.50

Source Dr. Robert J Glushko

Page 25: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

HTML Airline Schedule Seen “By Computer”

<Title>Airline Schedule</Title><Body><H2>Flight Information</H2><H3>United Airlines #200</H3><UL><LI>San Francisco

<LI>9:30 AM<LI>Honolulu

<LI>12:30 PM <LI>$368.50 </UL></Body>

Source Dr. Robert J Glushko

Page 26: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

Next Generation Web

Java

Computers Computers

.. making information and services accessible to computers (and people)

XML

Structured searchesAgentsNew models

Source Dr. Robert J Glushko

Page 27: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

Airline Schedule in XML<TransportSchedule Type=“Airline”><Segment Id=“United Airlines #200”> <Origin>San Francisco</Origin><DepartTime>9:30 AM</DepartTime> <Destination>Honolulu</Destination><ArriveTime>12:30 PM</ArriveTime> <Price Currency=“USD”>368.50</Price></Segment></TransportSchedule>

Source Dr. Robert J Glushko

Page 28: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

XML is a Foundation for Interoperability

Format based

API based

WEB EDI CORBA / COM

XML

.. exchange information in an application and vendor neutral format

Source Dr. Robert J Glushko

Page 29: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

Open framework for commerce

Co

mp

uter

Au

tom

otiv

e

Pin

nacles

HL

/7

Common Business Language

Procure Retail

XM

L/ E

DIO

BI

OT

PSC

OR

OF

X

•Shared Semantics•Extensible and “aggressively interoperable”

Health Care

Office

Co

nsu

mer

Manufac-turing-

Supply Chain

Appliances

Source Dr. Robert J Glushko

Page 30: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

Shared Semantics for Time and Location

Shared semantics for location and time in all schemas that need them enables richer “commerce networks” of services:

<TransportSchedule Type=“Airline”> ...<Destination>Honolulu</Destination>

<Accommodation Type=“Hotel”>...<Destination>Honolulu</Destination>

<Event Type=“Concert”>…<Destination>Honolulu</Destination>

Source Dr. Robert J Glushko

Page 31: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

Automated Vacation Planning Service

• Book me the cheapest flight to Honolulu the first week of January

• Find a hotel room for the day I arrive

• What concerts are taking place the next day?

Source Dr. Robert J Glushko

Page 32: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

The Common Business Language• Specifies common semantics, common syntax,

and message packaging for information held by and exchanged among transaction partners and market participants

• These documents are the interfaces among the commerce components envisioned in the overall eCo architecture being realized in a current ATP project being carried out by CNgroup, CommerceNet, BusinessBots, and Tesserae

• CBL’s focus is on the functions and information that are common to all business domains

Source Dr. Robert J Glushko

Page 33: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

CBL and XML• CBL documents are described by XML

DTDs to make them “self-descriptive” and validatable

• CBL builds on existing standard or industry semantics where possible

• Complex descriptions and messages can be composed from primitives

• Domain-specific XML applications can be implemented in “native” form or as “hybrids” for maximal interoperability

Source Dr. Robert J Glushko

Page 34: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

Common Business Language Building BlocksCBL DocumentsCBL Documents

Business Forms

CatalogCatalog

Purchase OrderPurchase Order

InvoiceInvoice

Business Descriptions

VendorVendor

ServicesServices

ProductsProducts

Measurements

TimeTime

CurrencyCurrency

WeightWeight

Locale

AddressAddress

CountryCountry

LanguageLanguage

Classification

SICSIC

NAICSNAICS

FSCFSC

core

core

core

core

core

Source Dr. Robert J Glushko

Page 35: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

Common Business Language Building BlocksCBL DocumentsCBL Documents

Business Forms

CatalogCatalog

Purchase OrderPurchase Order

InvoiceInvoice

Business Descriptions

VendorVendor

ServicesServices

ProductsProducts

Measurements

TimeTime

CurrencyCurrency

WeightWeight

Locale

AddressAddress

CountryCountry

LanguageLanguage

Classification

SICSIC

NAICSNAICS

FSCFSC

core

core

core

core

core

Source Dr. Robert J Glushko

Page 36: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

If Interested in CBL

• Visit: – http://www.xcbl.org/

• And for e-commerce applications using CBL, visit:– http://www.commerceone.com/

Page 37: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

SGML/XML Definitions

• Defining DTDs

• Markup

Page 38: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

SGML/XML Structure

• An SGML document consists of three parts:– The SGML Declaration– The Document Type Definition (DTD)– The Document Instance

• An XML document REQUIRES only the document instance, but for effective processing a DTD is very important

• XML Schema provides an alternative to DTDs for XML applications

Page 39: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

Document Type Definitions• The DTD describes the structural elements and

"shorthand" markup for a particular document type. It defines:– Names of "legal" elements– How many times elements can appear– The order of elements in a document– Whether markup can be omitted (SGML only)– Contents of elements (i.e., nested structures)– Attributes associated with elements– Names of "entities"– short-hand conventions for element tags. (SGML only)

Page 40: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

DTD Components

• The major components of a DTD are:– Entity Declarations– Element Declarations– Attribute Declarations

Page 41: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

Document Type Definitions• Entity Declarations are a "macro" definition facility for

both DTD and Document instance parts.– General Internal Entity Definitions

<!ENTITY name "substitute string">referenced by &name;

– General External Entity Definitions<!ENTITY name SYSTEM "file path">referenced by &name;

– Parameter Entity Definitions (used only inside DTDs)<!ENTITY %name "substitute string">or<!ENTITY %name SYSTEM "file path">referenced by %name; or %name

Page 42: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

Document Type Definitions

• Element Declarations define the structural elements of a document and its associated markup.<!ELEMENT name - - content_model or declared_content +(include_list) -(exclude_list) >– Omitted tag minimization indicates whether

start-tags or end-tags can be omitted in the markup (o) or (-) are required in SGML but can NOT be used in XML

Page 43: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

Document Type Definitions• Content model provides a nested

structural description of the elements that make up this element, e.g.:<!ELEMENT memo - - ((to & from), body,

close?)><!ELEMENT body - O (p)* ><!ELEMENT p - O (#PCDATA | q)*><!ELEMENT q - - (#PCDATA)>...– ANY (in SGML) may be used to indicate a

content model of any elements in the DTD, in any order.

Page 44: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

Document Type Definitions

• Same Content model in XML<?xml version = “1.0”?><!DOCTYPE memo [<!ELEMENT memo ((to | from)+, body, close?)>

<!ELEMENT body (p)* ><!ELEMENT p (#PCDATA | q)* ><!ELEMENT q (#PCDATA)>…

]>– Note the XML Processing instruction “Prolog”– Note that & in previous page is not legal XML

Page 45: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

Document Type Definitions• Declared content can be:

PCDATA, CDATA, RCDATA, EMPTY• Inclusion and Exclusion lists can be used to

indicate elements that can occur or are forbidden to occur in any sub-elements of the content model. (NOT in XML) E.g.:– <!ELEMENT memo -- ((to & from), body close?) +(fn)>

– says that element fn can appear anyplace in the memo.

Page 46: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

Document Type Definitions

• Attribute Declarations define attributes associated with (potentially) each element of a document and provide the acceptable values for those attributes.

Page 47: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

Attributes Example• <!ATTLIST associate_element attribute_name declared_value

default_value >• <!ATTLIST memo status (PUBLIC | CONFIDENTIAL)

PUBLIC>– In markup of a document:

<memo status="CONFIDENTIAL">also, because of the default set:<memo>would be the same as <memo status="PUBLIC">There are a variety of special defaults and data types that can be given in attribute definitions

Page 48: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

Sample SGML DTD<!doctype ELIB-TEXTS [

<!-- This is a DTD for bibliographic records extracted from the elib/rfc1357 simple bibliographic format. -->

<!ELEMENT ELIB-TEXTS o o (ELIB-BIB*)>

<!-- We allow most elements to occur any number of times in any order --><!-- this is because there is little consistency in the actual usage. --><!ELEMENT ELIB-BIB - - (BIB-VERSION, ID, ENTRY?, DATE?, TITLE*, ORGANIZATION*,(SERIES | TYPE | REVISION | REVISION-DATE |AUTHOR-PERSONAL | AUTHOR-INSTITUTIONAL | AUTHOR-CONTRIBUTING-PERSONAL |AUTHOR-CONTRIBUTING-PERSONAL | AUTHOR-CONTRIBUTING-INSTITUTIONAL | CONTACTAUTHOR | PROJECT | PAGES | BIOREGION | CERES-BIOREGION | TEXTSOUP | LOCATION |ULTIMATE-CLIENT | URL |KEYWORDS | NOTES | ABSTRACT)*, (TEXT-REF | PAGED-REF)* )>

<!-- We won't make any assumptions about content... all PCDATA -->

<!ELEMENT ID - o (#PCDATA)><!ELEMENT ABSTRACT - o (#PCDATA)><!ELEMENT AUTHOR-CONTRIBUTING-INSTITUTIONAL - o (#PCDATA)><!ELEMENT AUTHOR-CONTRIBUTING-PERSONAL - o (#PCDATA)><!ELEMENT AUTHOR-PERSONAL-CONTRIBUTING - o (#PCDATA)>… etc… ]>

Page 49: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

XML version<!doctype ELIB-TEXTS [

<!-- This is a DTD for bibliographic records extracted from the elib/rfc1357 simple bibliographic format. -->

<!ELEMENT ELIB-TEXTS(ELIB-BIB*)>

<!-- We allow most elements to occur any number of times in any order --><!-- this is because there is little consistency in the actual usage. --><!ELEMENT ELIB-BIB (BIB-VERSION, ID, ENTRY?, DATE?, TITLE*, ORGANIZATION*,(SERIES | TYPE | REVISION | REVISION-DATE |AUTHOR-PERSONAL | AUTHOR-INSTITUTIONAL | AUTHOR-CONTRIBUTING-PERSONAL |AUTHOR-CONTRIBUTING-PERSONAL | AUTHOR-CONTRIBUTING-INSTITUTIONAL | CONTACTAUTHOR | PROJECT | PAGES | BIOREGION | CERES-BIOREGION | TEXTSOUP | LOCATION |ULTIMATE-CLIENT | URL |KEYWORDS | NOTES | ABSTRACT)*, (TEXT-REF | PAGED-REF)* )>

<!-- We won't make any assumptions about content... all PCDATA -->

<!ELEMENT ID (#PCDATA)><!ELEMENT ABSTRACT (#PCDATA)><!ELEMENT AUTHOR-CONTRIBUTING-INSTITUTIONAL (#PCDATA)><!ELEMENT AUTHOR-CONTRIBUTING-PERSONAL (#PCDATA)><!ELEMENT AUTHOR-PERSONAL-CONTRIBUTING (#PCDATA)>… etc… ]>

Page 50: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

Document Using That DTD<ELIB-BIB><BIB-VERSION>ELIB-v1.0 </BIB-VERSION><ID>6</ID><ENTRY>February 13 1995</ENTRY><DATE>March 1, 1993</DATE><TITLE>Water Conditions in California Report 2</TITLE><ORGANIZATION>California Department of Water Resources</ORGANIZATION><SERIES>120-93</SERIES><TYPE>bulletin</TYPE><AUTHOR-INSTITUTIONAL>California Department of Water Resources </AUTHOR-INSTITUTIONAL><PAGES>17</PAGES><TEXT-REF>/elib/data/disk/disk5/documents/6/HYPEROCR/hyperocr.html </TEXT-REF><PAGED-REF>/elib/data/disk/disk5/documents/6/OCR-ASCII-NOZONE </PAGED-REF></ELIB-BIB>

Page 51: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

A More Complex DTD<!DOCTYPE USMARC [<!-- USMARC DTD. UCB-SLIS v.0.08 --><!-- By Jerome P. McDonough, April 1, 1994 --><!ELEMENT USMARC - - (Leader, Directry, VarFlds)><!ATTLIST USMARC Material (BK|AM|CF|MP|MU|VM|SE) "BK" id CDATA #IMPLIED><!-- Author's Note: the id attribute for the USMARC element is intended to hold a unique record number for each MARC record in the local database. That is to say, it is intended ONLY as an aid in maintaining the local database of MARC records -->

<!ELEMENT Leader - O (LRL, RecStat, RecType, BibLevel, UCP, IndCount, SFCount, BaseAddr, EncLevel, DscCatFm, LinkRec, EntryMap)><!ELEMENT Directry - O (#PCDATA)><!ELEMENT VarFlds - O (VarCFlds, VarDFlds)>

<!-- Component parts of Leader --><!-- Logical Record Length --><!ELEMENT LRL - O (#PCDATA)>…etc…

Page 52: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

More complex DTD (cont.)<!-- Variable Data Fields --><!ELEMENT VarDFlds - O (NumbCode, MainEnty?, Titles, EdImprnt?, PhysDesc?, Series?, Notes?, SubjAccs?, AddEnty?, LinkEnty?, SAddEnty?, HoldAltG?, Fld9XX?)>

<!-- Component Parts of Variable Data Fields --><!-- Numbers & Codes --><!ELEMENT NumbCode - O (Fld010?, Fld011?, Fld015?, Fld017*, Fld018?,

Fld019*, Fld020*, Fld022*, Fld023*, Fld024*, Fld025*, Fld027*,

Fld028*, Fld029*, Fld030*, Fld032*, Fld033*, Fld034*, Fld035*, Fld036?, Fld037*, Fld039*, Fld040?, Fld041?, Fld042?, Fld043?, Fld044?, Fld045?, Fld046?, Fld047?, Fld048*, Fld050*, Fld051*, Fld052*, Fld055*, Fld060*, Fld061*, Fld066?, Fld069*, Fld070*, Fld071*, Fld072*, Fld074*, Fld080?, Fld082*,

Fld084*, Fld086*, Fld088*, Fld090*, Fld096*)>

<!-- Main Entries --><!ELEMENT MainEnty - O (Fld100?, Fld110?, Fld111?, Fld130?)>

<!-- Titles --><!ELEMENT Titles - O (Fld210?, Fld211*, Fld212*, Fld214*, Fld222*,

Fld240?, Fld242*, Fld243?, Fld245, Fld246*, Fld247*)>

<!-- Edition, Imprint, etc. --><!ELEMENT EdImprnt - O (Fld250?, Fld254?, Fld255*, Fld256?, Fld257?, Fld260?, Fld261?, Fld262?, Fld263?, Fld265?)>

<!-- Physical Description, etc. --><!ELEMENT PhysDesc - O (Fld300*, Fld305*, Fld306?, Fld310?, Fld315?,

Fld321*, Fld340*, Fld350?, Fld351*,Fld355*, Fld357*, Fld362*)>

…etc…

Page 53: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

Complex DTD (cont.)

<!-- Title Statement --><!ELEMENT Fld245 - O (Six?, (a|b|c|f|g|h|k|n|p|s)+)><!ATTLIST Fld245 AddEnty (No|Yes|Blank) #IMPLIED NFChars (0|1|2|3|4|5|6|7|8|9|Blnk) #IMPLIED>

…etc…

<!-- Subfield Element Declarations --><!ELEMENT a - O (#PCDATA)><!ELEMENT b - O (#PCDATA)><!ELEMENT c - O (#PCDATA)><!ELEMENT d - O (#PCDATA)>

<!ELEMENT e - O (#PCDATA)>

Page 54: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

Document Markup• All document markup is derived from the DTD for the

particular document type.• The DTD must be referenced in the document using

the DOCTYPE declaration:– <!DOCTYPE name SYSTEM "file_path" >

or<!DOCTYPE name SYSTEM "file_path" [doctype_declaration_subset]>or<!DOCTYPE name [doctype_declaration_subset]>The doctype_declaration_subset can be any combination of elements, entity, and attribute declarations.

Page 55: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

HTML

• HTML was not originally "real" SGML, the DTD was invented after the language.

• It is often more concerned with the form of the output on the screen than with the structural contents of the HTML docs.

• Relies on the application (such as Netscape) to implement interesting actions like hypertext linking.

Page 56: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

How can you describe an information-bearing object?

Page 57: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

Dublin Core

• Review…

• Simple metadata for describing internet resources.

• For “Document-Like Objects”

• 15 Elements.

Page 58: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

Dublin Core Elements

• Title• Creator• Subject• Description• Publisher• Other Contributors• Date• Resource Type

• Format• Resource Identifier• Source• Language• Relation• Coverage• Rights Management

Page 59: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

DC DTD Implementation

• There have been various versions

• This one is the one recommended (required) by the Open Archives Initiative Metadata Harvesting Protocol (OAI-MHP)

• Uses XML Name Spaces• Available at

http://dublincore.org/documents/2001/09/20/dcmes-xml/

Page 60: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

DC Element and Attribute Definitions…

<!-- The elements from DCMES 1.1 -->

<!-- The name given to the resource. --> <!ELEMENT dc:title (#PCDATA)> <!ATTLIST dc:title xml:lang CDATA #IMPLIED>

<!-- An entity primarily responsible for making the content of the resource. --> <!ELEMENT dc:creator (#PCDATA)> <!ATTLIST dc:creator xml:lang CDATA #IMPLIED>

<!-- The topic of the content of the resource. --> <!ELEMENT dc:subject (#PCDATA)> <!ATTLIST dc:subject xml:lang CDATA #IMPLIED>

<!-- An account of the content of the resource. --> <!ELEMENT dc:description (#PCDATA)> <!ATTLIST dc:description xml:lang CDATA #IMPLIED>

<!-- The entity responsible for making the resource available. --> <!ELEMENT dc:publisher (#PCDATA)> <!ATTLIST dc:publisher xml:lang CDATA #IMPLIED>

<!-- An entity responsible for making contributions to the content of the resource. --> <!ELEMENT dc:contributor (#PCDATA)> <!ATTLIST dc:contributor xml:lang CDATA #IMPLIED>

<!-- A date associated with an event in the life cycle of the resource. --> <!ELEMENT dc:date (#PCDATA)> <!ATTLIST dc:date xml:lang CDATA #IMPLIED>

Page 61: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

DC Element Definitions, Cont.<!-- The nature or genre of the content of the resource. --> <!ELEMENT dc:type (#PCDATA)> <!ATTLIST dc:type xml:lang CDATA #IMPLIED>

<!-- The physical or digital manifestation of the resource. --> <!ELEMENT dc:format (#PCDATA)> <!ATTLIST dc:format xml:lang CDATA #IMPLIED>

<!-- An unambiguous reference to the resource within a given context. --> <!ELEMENT dc:identifier (#PCDATA)> <!ATTLIST dc:identifier xml:lang CDATA #IMPLIED> <!ATTLIST dc:identifier rdf:resource CDATA #IMPLIED>

<!-- A Reference to a resource from which the present resource is derived. --> <!ELEMENT dc:source (#PCDATA)> <!ATTLIST dc:source xml:lang CDATA #IMPLIED> <!ATTLIST dc:source rdf:resource CDATA #IMPLIED>

<!-- A language of the intellectual content of the resource. --> <!ELEMENT dc:language (#PCDATA)> <!ATTLIST dc:language xml:lang CDATA #IMPLIED>

<!-- A reference to a related resource. --> <!ELEMENT dc:relation (#PCDATA)> <!ATTLIST dc:relation xml:lang CDATA #IMPLIED> <!ATTLIST dc:relation rdf:resource CDATA #IMPLIED>

<!-- The extent or scope of the content of the resource. --> <!ELEMENT dc:coverage (#PCDATA)> <!ATTLIST dc:coverage xml:lang CDATA #IMPLIED>

<!-- Information about rights held in and over the resource. --> <!ELEMENT dc:rights (#PCDATA)> <!ATTLIST dc:rights xml:lang CDATA #IMPLIED>

Page 62: 11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

11/15/2001 Information Organization and Retrieval

SGML and XML Sources and Resources

• Books: van Herwijnen, Eric. Practical SGML. (2nd Ed.) Boston: Kluwer Academic Publishers, 1994.Goldfarb, Charles F. The SGML Handbook. Oxford: Clarenden Press, 1990. (And MANY XML books)

• Web Sites:– Robin Cover’s SGML/XML Site

http://www.oasis-open.org/cover/sgml-xml.html