Content Types: Text and Metadata. Introduction Text documents come in many forms –Article (news, conference, journal, etc.) –Email, memo, … –Book, manual,

Content Types:Text and Metadata

Introduction

Text documents come in many forms– Article (news, conference, journal, etc.)– Email, memo, …– Book, manual, manuscript, transcript, …– Any part of one of the above

Syntax can express– Structure– Presentation style– Semantics (e.g. software code)

Metadata

Metadata – data about dataDescriptive metadata

– External to meaning of document– Author, publication date, document source,

document length, document genre, file type, bits per second, frame rate, etc.

Semantic metadata– Characterizes semantic content of document– LoC subject heading, keywords, subject

headings from ontologies (e.g. MESH), etc.

Metadata Formats

Machine Readable Cataloging Record (MARC)– Used by most libraries– Fields include title, author, etc.

Resource Description Framework (RDF)– Used for Web resources– Node and attribute / value pairs– Node ID is any Uniform Resource Identifier

(URI), which could be a URL

Metadata Sets

Dublin Core Metadata Elements1. Contributor – entities contributing to the content2. Coverage – extent or scope of content (spatial area, temporal period, …)3. Creator – entity primarily responsible for making the content4. Date – date associated with event (e.g. publication) for resource5. Description – abstract, table of contents, …6. Format – media (file) type, dimensions (size, duration), hardware needed7. Identifier – unique identifier8. Language – language of content9. Publisher – entity responsible for making resource available10. Relation – reference to related resource(s)11. Rights – information about rights held in/over resource12. Source – resource from which content is derived13. Subject – keywords, key phrases, classification code, etc.14. Title – name of the resource15. Type – nature or genre of content

Text FormatsCoding schemes

– EBCDIC (7 bit, one of first coding schemes)– ASCII (initially 7 bit, extended to 8 bit)– Unicode (16 bit for large alphabets)

Additional Formats– RTF (format-oriented document exchange)– PDF and PostScript (display-oriented

representation)– Multipurpose Internet Mail Exchange (MIME)

(multiple character sets, languages, media)

Information Theory

How can we predict information value of components of a document?

Entropy – attempts to model information content (information uncertainty)

E = - Sum all symbols in alphabet (pi log2 pi)

pi is the probability of symbol I (symbol frequency over number of symbols)

Need a text model for real language

Also important for compression as E acts as a limit of how much a text can be compressed.

Modeling Character Strings

Symbols in NL are not evenly distributed– Some symbols are not part of words (often used for

syntax)– Symbols in words are not evenly distributed

Models– Binomial model uses distribution of symbols in

language• But previous symbols influence probabilities of later

symbols • (what letter will appear after a q?)

– Finite context or Markovian models used for this dependency

• k-order where k is the number of previous characters taken into account by the model

• Thus, the binomial model is a 0-order model

Word Distribution in DocumentsHow frequent are words within documents?

Zipf’s Law– Frequency of the ith most frequent word is 1/itheta *

frequency of most frequent word– The value of theta depends on the text (value of 1 is

logarithmic distribution)– Theta values of 1.5 to 2.0 best model real texts

In practice, a few hundred words make up 50% of most texts– Frequent words provide less information– Thus, many search strategies involve ignoring

stopwords (a, an, the, is, of, by, …)

Word Distribution in Collections

Simplest to assume uniform distribution of words in documents– But not true

Better models built on negative binomial distributions or Poisson distributions

Vocabulary Size for Documents and Collections

Heap’s Law– Vocabulary size (V) grows with number of

words (n)• V = Knb

• Experimentally, – K is between 10 and 100– B is between 0.4 and 0.6

– So vocabulary grows proportionally with the square root of the size of the document or collection in words

– Works best for large documents & collections

String Similarity Models

Similarity is measured by a distance functionHamming distance – number of characters different

in stringsLevenshtein distance – minimum number of

insertions, deletions, and substitutions needed to make strings equal– color to colour is 1– survey to surgery is 2

Can be extended to documents– UNIX diff treats each line as a character

Content Types:Markup and Multimedia

Introduction

Markup languages use extra textual syntax to encode:– Formatting / display information– Structure information– Descriptive metadata– Semantic metadata

Marks are often called tags– The act of adding markup is called tagging– Most markup languages use initial and

ending tags surrounding the marked text

Standard Generalized Markup Language (SGML)

Metalanguage for markup.– Includes rules for defining markup language– Use of SGML includes

• Description of structure of markup• Text marked with tags

Document Type Declaration (DTD)– Describes and names tags and how they are

related– Comments used to express interpretation of

tags (meaning, presentation, …)

SGML DTD Example

<!– SGML DTD for electronic messages - - ><! ELEMENT e-mail - - (prolog, contents) ><! ELEMENT prolog - - (sender, address+ , subject?,

Cc*) ><! ELEMENT (sender | address | subject | Cc) - 0 (#PCDATA) ><! ELEMENT contents - - (par | image | audio)+ ><! ELEMENT par - 0 (ref | #PCDATA)+><! ELEMENT ref - 0 EMPTY ><! ELEMENT (image | audio) - - (#NDATA) >

<! ATTLIST e-mailid ID #REQUIREDdate_sent DATE #REQUIREDstatus (secret | public ) public >

<! ATTLIST refid IDREF #REQUIRED >

<! ATTLIST (image | audio)id IDREF #REQUIRED >

SGML Example<!– DOCTYPE e-mail SYSTEM “e-mail.dtd”><e-mail id=94108rby date_sent=02101998><prolog><sender> Pablo Neruda</sender><address> Federico Garcia Lorca</address><address> Ernest Hemingway</address><subject> Picture of my house in Isla<Cc> Gabriel Garcia Marquez</Cc></prolog><contents><par>Here are two photos. One is of the view (photo <ref idref=F2>).</par><image id=F1> “photo1.gif” </image><image id=F2> “photo2.jpg” </image></contents></e-mail>

SGML Characteristics

DTD provides ability to determine if a given document is well-formed.

SGML generally does not specify presentation/appearance.

Output specification standards:– DSSSL (Document Style Semantic

Specification Language)– FOSI (Formatted Output Specification

Instance)

HyperText Markup Language (HTML)

Based on SGML– HTML DTD not explicitly referenced by documents

HTML documents can have documents embedded within them– Images or audio– Frames with other HTML documents

When programs are included, it is referred to as Dynamic HTML

Strict HTML includes only non-presentational markup.– Cascade Style Sheets (CSS) used to define presentation

In reality, presentational and structural markup are blended by HTML authoring applications.

(Original) HTML Limitations

In contrast to SGML:– Users cannot specify their own tags or

attributes.– No support for nested structures that can

represent database schemas or object-oriented hierarchies.

– No support for validation of document by consuming applications.

eXtensible Markup Language (XML)

XML is a simplified subset of SGML– XML is a meta-language– XML designed for semantic markup that is

both human and machine readable– No DTD is required– All tags must be closed

Extensible Style sheet Language (XSL)– XML equivalent of CSS– Can be used to convert XML into HTML and

CSS

Multimedia

Lots of data file formats for non-textual data– Images

• BMP, GIF, JPEG (JPG), TIFF

– Audio• AU, MIDI, WAVE, MP3

– Video• MPEG, AVI, QuickTime

– Graphics / Virtual Environments• CGM, VRML, OpenGL

Audio and Video

Data files often have:– Header

• Indicates time granularity, number of channels, bits per channel

• Somewhat like a DTD

– Data• The signal

Data may be compressed– Data may be in frequency domain rather than time

domain– Data may be encoded as sequence of differences

between consecutive time segments.