46
Theme UML Extending UML to treat aspects Presented by Moran Birenbaum

10/23/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 10/23/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

10/23/2001 Information Organization and Retrieval

Information Structures and Metadata

University of California, Berkeley

School of Information Management and Systems

SIMS 202: Information Organization and Retrieval

Slides by Ray R. Larson and Marti Hearst

Page 2: 10/23/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

10/23/2001 Information Organization and Retrieval

Review

• The Course

• Information Hierarchy

• Volume of information and growth of the Internet

Page 3: 10/23/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

10/23/2001 Information Organization and Retrieval

Information Organization and Retrieval

• To organize is to (1) furnish with organs, make organic, make into living tissue, become organic; (2) form into an organic whole; give orderly structure to; frame and put into working order; make arrangements for.

• Knowledge is knowing, familiarity gained by experience; person’s range of information; a theoretical or practical understanding of; the sum of what is known.

• To retrieve is to (1) recover by investigation or effort of memory, restore to knowledge or recall to mind; regain possession of; (2) rescue from a bad state, revive, repair, set right.

• Information is (1) informing, telling; thing told, knowledge, items of knowledge, news.

The Oxford English Dictionary

Page 4: 10/23/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

10/23/2001 Information Organization and Retrieval

Two Main Themes

Information Organization and

Design

Information Retrieval and the Search Process

Page 5: 10/23/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

10/23/2001 Information Organization and Retrieval

Course Schedule• Organization

– Overview– Metadata and Markup– Categories, Controlled

Vocabularies, Classification, Thesauri

– Information Design• Thesaurus Design• Information

Architecture• Database Design

• Retrieval– The Search Process– Content Analysis

• Tokenization, Zipf’s Law, Lexical Associations

– IR Implementation– Term weighting and

document ranking• Vector space model• Probabilistic model

– User Interfaces• Overviews, query

specification, providing context, relevance feedback

Page 6: 10/23/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

10/23/2001 Information Organization and Retrieval

Information Hierarchy

Wisdom

Knowledge

Information

Data

Page 7: 10/23/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

10/23/2001 Information Organization and Retrieval

Totals Stored Per YearMedium Type of content Terabytes/Year Terabytes/Year Upper Bound Lower Bound Paper Books 8 1 Newspapers 25 2 Periodicals 12 1 Office documents 195 19 SUBTOTAL 240 23Film Photographs 410,000 41,000 Cinema 16 16 X-Rays 17,200 17,200 SUBTOTAL 427,216 58,216Optical Music CDs 58 6 Data CDs 3 3 DVDs 22 22 SUBTOTAL 83 31Magnetic Camcorder 300,000 300,000 Disk drives 1,393,000 277,210 SUBTOTAL 1,693,000 577,210TOTAL 2,120,539 635,480

Page 8: 10/23/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

10/23/2001 Information Organization and Retrieval

Projected Voice and Data Traffic

0

5000

10000

15000

20000

25000

30000

1996 1997 1998 1999 2000 2001 2002

VoiceData

Gb/s

Source: America's Network, May 15, 1998

Page 9: 10/23/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

10/23/2001 Information Organization and Retrieval

Internet Hosts (000s) 1989-2006

0

100000

200000

300000

400000

500000

600000

700000

800000

900000

1000000

1989

1991

1993

1995

1997

1999

2001

2003

2005

hosts

Source: Vint Cerf

Page 10: 10/23/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

10/23/2001 Information Organization and Retrieval

Information Overload

• “The greatest problem of today is how to teach people to ignore the irrelevant, how to refuse to know things, before they are suffocated. For too many facts are as bad as none at all.” (W.H. Auden)

Page 11: 10/23/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

10/23/2001 Information Organization and Retrieval

Today

• Organization of Information

• Information Life Cycle (review)

• Introduction to structured information (SGML/XML)

• Metadata and the Dublin Core

Page 12: 10/23/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

10/23/2001 Information Organization and Retrieval

Organization of Information

• Is there a basic human need to put things into some sort of order?– Much of natural language concerns categories

of things rather than individual things (more on this next week)

– Why do we organize things and information?• Why do spoons go in THAT drawer in the kitchen

and not in a can in the garage?• Why do your favorite books go on one shelf and

not-so-favorite on another?

Page 13: 10/23/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

10/23/2001 Information Organization and Retrieval

Why Organize Information?

• The main reason– So that you can find things more effectively

• I.e., Effective retrieval is predicated on some sort of organization applied to information resources

• Historically there have been many institutions and tools devoted to information organization– Libraries– Museums– Archives– Indexes and catalogs, dictionaries, Phone books, etc.

Page 14: 10/23/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

10/23/2001 Information Organization and Retrieval

Why Organize Information?

• A question of scale:– Using your own ad hoc set of categories and

methods to organize your own collection of books seems to work fine…

– What if your collection grew to• 10 Times the size? How would you organize it?• 100 Times? • 1000 Times?• 100000 times?

Page 15: 10/23/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

10/23/2001 Information Organization and Retrieval

What is Information Organization?

• Identifying the existence of all types of information-bearing entities as they are made available

• Identifying the works contained within those information-bearing entities or as parts of them

• Systematically pulling together these information-bearing entities into collections in libraries, archives, museums, Internet communications files and other such depositories.

From Hagler via Taylor, Chap. 1

Page 16: 10/23/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

10/23/2001 Information Organization and Retrieval

What is Information Organization?

• Producing lists of these information-bearing entities prepared according to standard rules for citation

• Providing name, title, subject and other useful access to these information-bearing entities

• Providing the means of locating each information-bearing entity or a copy of it

Page 17: 10/23/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

10/23/2001 Information Organization and Retrieval

Organizating Information

• Libraries

• Archives

• Museums and Galleries

• Internet

• Corporate and Office environments

Page 18: 10/23/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

10/23/2001 Information Organization and Retrieval

Information Life CycleCreation

Utilization Searching

Active

Inactive

Semi-Active

Retention/Mining

Disposition

Discard

Using Creating

AuthoringModifying

OrganizingIndexing

StoringRetrieval

DistributionNetworking

AccessingFiltering

Page 19: 10/23/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

10/23/2001 Information Organization and Retrieval

Authoring/Modifying

• Converting Data+Information+Knowledge to New Information.

• Creating information from observation, thought.

• Editing and Publication.

• Gatekeeping

Page 20: 10/23/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

10/23/2001 Information Organization and Retrieval

Organizing/Indexing

• Collecting and Integrating information.

• Affects Data, Information and Metadata.

• “Metadata” Describes data and information.– More on this later.

• Organizing Information.– Types of organization?

• Indexing

Page 21: 10/23/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

10/23/2001 Information Organization and Retrieval

Storing/Retrieving

• Information Storage – How and Where is Information stored?

• Retrieving Information.– How is information recovered from storage– How to find needed information– Linked with Accessing/Filtering stage

Page 22: 10/23/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

10/23/2001 Information Organization and Retrieval

Distribution/Networking

• Transmission of information– How is information transmitted?

• Networks vs Broadcast.

Page 23: 10/23/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

10/23/2001 Information Organization and Retrieval

Accessing/Filtering

• Using the organization created in the O/I stage to:– Select desired (or relevant) information– Locate that information– Retrieve the information from its storage

location (often via a network)

Page 24: 10/23/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

10/23/2001 Information Organization and Retrieval

Using/Creating

• Using Information.

• Transformation of Information to Knowledge.

• Knowledge to New Data and New Information.

Page 25: 10/23/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

10/23/2001 Information Organization and Retrieval

Key issues in this course• How to describe information resources or

information-bearing objects in ways so that they may be effectively used by those who need to use them.– Organizing

• How to find the appropriate information resources or information-bearing objects for someone’s (or your own) needs.– Retrieving

Page 26: 10/23/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

10/23/2001 Information Organization and Retrieval

Key IssuesCreation

Utilization Searching

Active

Inactive

Semi-Active

Retention/Mining

Disposition

Discard

Using Creating

AuthoringModifying

OrganizingIndexing

StoringRetrieval

DistributionNetworking

AccessingFiltering

Page 27: 10/23/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

10/23/2001 Information Organization and Retrieval

Structure of an IR SystemInterest profiles

& QueriesDocuments

& data

Rules of the game =Rules for subject indexing +

Thesaurus (which consists of

Lead-InVocabulary

andIndexing

Language

StorageLine

Potentially Relevant

Documents

Comparison/Matching

Store1: Profiles/Search requests

Store2: Documentrepresentations

Indexing (Descriptive and

Subject)

Formulating query in terms of

descriptors

Storage of profiles

Storage of Documents

Information Storage and Retrieval System

Page 28: 10/23/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

10/23/2001 Information Organization and Retrieval

Metadata• Metadata is:

– “data about data” (database systems)– Information about Information– Structures and Languages for the Description of

Information Resources and their elements (components or features)

– “Metadata is information on the organization of the data, the various data domains, and the relationship between them” (Baeza-Yates p. 142)

Page 29: 10/23/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

10/23/2001 Information Organization and Retrieval

Types of Metadata

• Element names.

• Element description.

• Element representation.

• Element coding.

• Element semantics.

• Element classification.

Page 30: 10/23/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

10/23/2001 Information Organization and Retrieval

How can you describe an information-bearing object?

Page 31: 10/23/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

10/23/2001 Information Organization and Retrieval

Dublin Core

• Simple metadata for describing internet resources.

• For “Document-Like Objects”

• 15 Elements (in base DC)

Page 32: 10/23/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

10/23/2001 Information Organization and Retrieval

Dublin Core Elements

• Title• Creator• Subject• Description• Publisher• Other Contributors• Date• Resource Type

• Format• Resource Identifier• Source• Language• Relation• Coverage• Rights Management

Page 33: 10/23/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

10/23/2001 Information Organization and Retrieval

Title

• Label: TITLE

• The name given to the resource by the CREATOR or PUBLISHER.

Page 34: 10/23/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

10/23/2001 Information Organization and Retrieval

Author or Creator

• Label: CREATOR

• The person(s) or organization(s) primarily responsible for the intellectual content of the resource. For example, authors in the case of written documents, artists, photographers, or illustrators in the case of visual resources.

Page 35: 10/23/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

10/23/2001 Information Organization and Retrieval

Subject and Keywords

• Label: SUBJECT • The topic of the resource, or keywords or phrases that

describe the subject or content of the resource. The intent of the specification of this element is to promote the use of controlled vocabularies and keywords. This element might well include scheme-qualified classification data (for example, Library of Congress Classification Numbers or Dewey Decimal numbers) or scheme-qualified controlled vocabularies (such as MEdical Subject Headings or Art

and Architecture Thesaurus descriptors) as well.

Page 36: 10/23/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

10/23/2001 Information Organization and Retrieval

Description

• Label: DESCRIPTION • A textual description of the content of the resource,

including abstracts in the case of document-like objects or content descriptions in the case of visual resources. Future metadata collections might well include computational content description (spectral analysis of a visual resource, for example) that may not be embeddable in current network systems. In such a case this field might contain a link to such a description rather than the description itself.

Page 37: 10/23/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

10/23/2001 Information Organization and Retrieval

Publisher

• Label: PUBLISHER

• The entity responsible for making the resource available in its present form, such as a publisher, a university department, or a corporate entity. The intent of specifying this field is to identify the entity that provides access to the resource.

Page 38: 10/23/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

10/23/2001 Information Organization and Retrieval

Other Contributors• Label: CONTRIBUTORS • Person(s) or organization(s) in addition to

those specified in the CREATOR element who have made significant intellectual contributions to the resource but whose contribution is secondary to the individuals or entities specified in the CREATOR element (for example, editors, transcribers, illustrators, and convenors).

Page 39: 10/23/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

10/23/2001 Information Organization and Retrieval

Date

• Label: DATE• The date the resource was made available in its

present form. The recommended best practice is an 8 digit number in the form YYYYMMDD as defined by ANSI X3.30-1985. In this scheme, the date element for the day this is written would be 19961203, or December 3, 1996. Many other schema are possible, but if used, they should be identified in an unambiguous manner.

Page 40: 10/23/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

10/23/2001 Information Organization and Retrieval

Resource Type

• Label: TYPE • The category of the resource, such as home

page, novel, poem, working paper, preprint, technical report, essay, dictionary. It is expected that RESOURCE TYPE will be chosen from an enumerated list of types. A preliminary set of such types can be found at the following URL: http://www.roads.lut.ac.uk/Metadata/DC-ObjectTypes.html

Page 41: 10/23/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

10/23/2001 Information Organization and Retrieval

Format• Label: FORMAT • The data representation of the resource, such as text/html,

ASCII, Postscript file, executable application, or JPEG image. The intent of specifying this element is to provide information necessary to allow people or machines to make decisions about the usability of the encoded data (what hardware and software might be required to display or execute it, for example). As with RESOURCE TYPE, FORMAT will be assigned from enumerated lists such as registered Internet Media Types (MIME types). In principal, formats can include physical media such as books, serials, or other non-electronic media.

Page 42: 10/23/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

10/23/2001 Information Organization and Retrieval

Resource Identifier• Label: IDENTIFIER • String or number used to uniquely identify

the resource. Examples for networked resources include URLs and URNs (when implemented). Other globally-unique identifiers,such as International Standard Book Numbers (ISBN) or other formal names would also be candidates for this element.

Page 43: 10/23/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

10/23/2001 Information Organization and Retrieval

Source

• Label: SOURCE

• The work, either print or electronic, from which this resource is derived, if applicable. For example, an html encoding of a Shakespearean sonnet might identify the paper version of the sonnet from which the electronic version was transcribed.

Page 44: 10/23/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

10/23/2001 Information Organization and Retrieval

Language

• Label: LANGUAGE

• Language(s) of the intellectual content of the resource. Where practical, the content of this field should coincide with the Z39.53 three character codes for written languages. See: http://www.sil.org/sgml/nisoLang3-1994.html

Page 45: 10/23/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

10/23/2001 Information Organization and Retrieval

Relation

• Label: RELATION• Relationship to other resources. The intent of specifying

this element is to provide a means to express relationships among resources that have formal relationships to others, but exist as discrete resources themselves. For example, images in a document, chapters in a book, or items in a collection. A formal specification of RELATION is currently under development. Users and developers should understand that use of this element should be currently considered experimental.

Page 46: 10/23/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

10/23/2001 Information Organization and Retrieval

Coverage

• Label: COVERAGE

• The spatial locations and temporal duration characteristic of the resource. Formal specification of COVERAGE is currently under development. Users and developers should understand that use of this element should be currently considered experimental.

Page 47: 10/23/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management

10/23/2001 Information Organization and Retrieval

Rights Management

• Label: RIGHTS • The content of this element is intended to be a link (a URL

or other suitable URI as appropriate) to a copyright notice, a rights-management statement, or perhaps a server that would provide such information in a dynamic way. The intent of specifying this field is to allow providers a means to associate terms and conditions or copyright statements with a resource or collection of resources. No assumptions should be made by users if such a field is empty or not present.