View
233
Download
2
Embed Size (px)
Citation preview
Content Types:Text and Metadata
Introduction
Text documents come in many forms– Article (news, conference, journal, etc.)– Email, memo, …– Book, manual, manuscript, transcript, …– Any part of one of the above
Syntax can express– Structure– Presentation style– Semantics (e.g. software code)
Metadata
Metadata – data about dataDescriptive metadata
– External to meaning of document– Author, publication date, document source,
document length, document genre, file type, bits per second, frame rate, etc.
Semantic metadata– Characterizes semantic content of document– LoC subject heading, keywords, subject
headings from ontologies (e.g. MESH), etc.
Metadata Formats
Machine Readable Cataloging Record (MARC)– Used by most libraries– Fields include title, author, etc.
Resource Description Framework (RDF)– Used for Web resources– Node and attribute / value pairs– Node ID is any Uniform Resource Identifier
(URI), which could be a URL
Metadata Sets
Dublin Core Metadata Elements1. Contributor – entities contributing to the content2. Coverage – extent or scope of content (spatial area, temporal period, …)3. Creator – entity primarily responsible for making the content4. Date – date associated with event (e.g. publication) for resource5. Description – abstract, table of contents, …6. Format – media (file) type, dimensions (size, duration), hardware needed7. Identifier – unique identifier8. Language – language of content9. Publisher – entity responsible for making resource available10. Relation – reference to related resource(s)11. Rights – information about rights held in/over resource12. Source – resource from which content is derived13. Subject – keywords, key phrases, classification code, etc.14. Title – name of the resource15. Type – nature or genre of content
Text FormatsCoding schemes
– EBCDIC (7 bit, one of first coding schemes)– ASCII (initially 7 bit, extended to 8 bit)– Unicode (16 bit for large alphabets)
Additional Formats– RTF (format-oriented document exchange)– PDF and PostScript (display-oriented
representation)– Multipurpose Internet Mail Exchange (MIME)
(multiple character sets, languages, media)
Information Theory
How can we predict information value of components of a document?
Entropy – attempts to model information content (information uncertainty)
E = - Sum all symbols in alphabet (pi log2 pi)
pi is the probability of symbol I (symbol frequency over number of symbols)
Need a text model for real language
Also important for compression as E acts as a limit of how much a text can be compressed.
Modeling Character Strings
Symbols in NL are not evenly distributed– Some symbols are not part of words (often used for
syntax)– Symbols in words are not evenly distributed
Models– Binomial model uses distribution of symbols in
language• But previous symbols influence probabilities of later
symbols • (what letter will appear after a q?)
– Finite context or Markovian models used for this dependency
• k-order where k is the number of previous characters taken into account by the model
• Thus, the binomial model is a 0-order model
Word Distribution in DocumentsHow frequent are words within documents?
Zipf’s Law– Frequency of the ith most frequent word is 1/itheta *
frequency of most frequent word– The value of theta depends on the text (value of 1 is
logarithmic distribution)– Theta values of 1.5 to 2.0 best model real texts
In practice, a few hundred words make up 50% of most texts– Frequent words provide less information– Thus, many search strategies involve ignoring
stopwords (a, an, the, is, of, by, …)
Word Distribution in Collections
Simplest to assume uniform distribution of words in documents– But not true
Better models built on negative binomial distributions or Poisson distributions
Vocabulary Size for Documents and Collections
Heap’s Law– Vocabulary size (V) grows with number of
words (n)• V = Knb
• Experimentally, – K is between 10 and 100– B is between 0.4 and 0.6
– So vocabulary grows proportionally with the square root of the size of the document or collection in words
– Works best for large documents & collections
String Similarity Models
Similarity is measured by a distance functionHamming distance – number of characters different
in stringsLevenshtein distance – minimum number of
insertions, deletions, and substitutions needed to make strings equal– color to colour is 1– survey to surgery is 2
Can be extended to documents– UNIX diff treats each line as a character
Content Types:Markup and Multimedia
Introduction
Markup languages use extra textual syntax to encode:– Formatting / display information– Structure information– Descriptive metadata– Semantic metadata
Marks are often called tags– The act of adding markup is called tagging– Most markup languages use initial and
ending tags surrounding the marked text
Standard Generalized Markup Language (SGML)
Metalanguage for markup.– Includes rules for defining markup language– Use of SGML includes
• Description of structure of markup• Text marked with tags
Document Type Declaration (DTD)– Describes and names tags and how they are
related– Comments used to express interpretation of
tags (meaning, presentation, …)
SGML DTD Example
<!– SGML DTD for electronic messages - - ><! ELEMENT e-mail - - (prolog, contents) ><! ELEMENT prolog - - (sender, address+ , subject?,
Cc*) ><! ELEMENT (sender | address | subject | Cc) - 0 (#PCDATA) ><! ELEMENT contents - - (par | image | audio)+ ><! ELEMENT par - 0 (ref | #PCDATA)+><! ELEMENT ref - 0 EMPTY ><! ELEMENT (image | audio) - - (#NDATA) >
<! ATTLIST e-mailid ID #REQUIREDdate_sent DATE #REQUIREDstatus (secret | public ) public >
<! ATTLIST refid IDREF #REQUIRED >
<! ATTLIST (image | audio)id IDREF #REQUIRED >
SGML Example<!– DOCTYPE e-mail SYSTEM “e-mail.dtd”><e-mail id=94108rby date_sent=02101998><prolog><sender> Pablo Neruda</sender><address> Federico Garcia Lorca</address><address> Ernest Hemingway</address><subject> Picture of my house in Isla<Cc> Gabriel Garcia Marquez</Cc></prolog><contents><par>Here are two photos. One is of the view (photo <ref idref=F2>).</par><image id=F1> “photo1.gif” </image><image id=F2> “photo2.jpg” </image></contents></e-mail>
SGML Characteristics
DTD provides ability to determine if a given document is well-formed.
SGML generally does not specify presentation/appearance.
Output specification standards:– DSSSL (Document Style Semantic
Specification Language)– FOSI (Formatted Output Specification
Instance)
HyperText Markup Language (HTML)
Based on SGML– HTML DTD not explicitly referenced by documents
HTML documents can have documents embedded within them– Images or audio– Frames with other HTML documents
When programs are included, it is referred to as Dynamic HTML
Strict HTML includes only non-presentational markup.– Cascade Style Sheets (CSS) used to define presentation
In reality, presentational and structural markup are blended by HTML authoring applications.
(Original) HTML Limitations
In contrast to SGML:– Users cannot specify their own tags or
attributes.– No support for nested structures that can
represent database schemas or object-oriented hierarchies.
– No support for validation of document by consuming applications.
eXtensible Markup Language (XML)
XML is a simplified subset of SGML– XML is a meta-language– XML designed for semantic markup that is
both human and machine readable– No DTD is required– All tags must be closed
Extensible Style sheet Language (XSL)– XML equivalent of CSS– Can be used to convert XML into HTML and
CSS
Multimedia
Lots of data file formats for non-textual data– Images
• BMP, GIF, JPEG (JPG), TIFF
– Audio• AU, MIDI, WAVE, MP3
– Video• MPEG, AVI, QuickTime
– Graphics / Virtual Environments• CGM, VRML, OpenGL
Audio and Video
Data files often have:– Header
• Indicates time granularity, number of channels, bits per channel
• Somewhat like a DTD
– Data• The signal
Data may be compressed– Data may be in frequency domain rather than time
domain– Data may be encoded as sequence of differences
between consecutive time segments.