Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008 · 2008-05-26 · Patrick...

Preview:

Citation preview

X

X

Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008

The New DocumentDigital

Polymorphic

Ubiquitous

Actionable

Patrick P. Bergmans

University of Ghent

Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008

The Traditional Document

• Documents have been around for thousands of years

– The Bible is a document

– The scrolls of the Dead Sea are documents

– Hieroglyphs of the Ancient Egypt are documents

• Documents have been and continue to be the support of a

large fraction of human knowledge

• Documents are stored on a specific medium

– For centuries, the traditional medium for documents has

been paper

– Recently, the storage medium for documents has become

digital

Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008

The Traditional Document

• Paper documents were a fairly simple concept

• Digital documents are much more complex, because of their

numerous additional attributes

• The Digital Document is polymorphic; it has many, many

different embodiments and representations

• Computer Scientists have introduced formal Document Models

• These models are used to

– Analyze document transformations and evolutions

– Identify resources needed for those transformations

– Define Document Processes that govern these

transformations

Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008

Dimensions of Document Space

• In these models, documents are contained in a multi-

dimensional document space (content, structure, format, time,

spatial, others … ), identifying their specific properties along

the axes of the space

• Documents transformations are trajectories in document

space, describing the life of a document and its evolutions

• The multidimensional document space can be simplified by

projecting it onto sub-spaces

• The initial (content, structure, format) model considers the sub-

space of documents independently of time and space

Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008

Expected Model Benefits

• Precise definitions (giving common terminology)

• Definition of (generic) operators for document transformation

– Copy, Move, Erase, Print, ...

• Explicitly show where conceptual difficulties lie, giving some

ideas of their fundamental nature

• Enable reasoning on document transitions (e.g. versioning and

properties inheritances, document rights)

Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008

Content-Structure-Format Model

(three-dimensional projection)

Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008

Content-Structure-Format model

• This sub-space of the full document model can be used to

illustrate how knowledge, meaning and content are derived

and transformed during the structuring, formatting and physical

output of a document

• Vertical axis is some sort of overall “evolutionary” axis, but not

exactly a time axis

• Local Transformations are

– Content transformations at any time

– Structure transformations in the structured document plane

– Format transformations in the styled document plane

Logical premises

Text, Artwork

Language, Pictorial, Musical, Gesture

DTD

DOC, WPF, RTF, (HTML)

SGML, XML, (HTML)

Style sheet, XSL

PDF, PS, PCL, MIDI

Fonts

TIFF, GIF, BMP, WAV

Page size, Screen Resolution

Paper, Sound, Video, Voice

Screens, CD, Audio cassette, VHS, Minidisk, DVD

Meaning

Intent

Knowledge

Basic Content

Form

LogicalStructure

Styled Content

Structured Content

Presentation Format

Output Representation

Resources

Raw Digital Image

Media Properties

Physical Representation

DeviceProperties

Layer &PropertyExamples

Structure/Format planes

TransitionProperties

Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008

Digital Documents

Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008

Digital Documents

• There are many forms of Digital Documents

• It is extremely important to distinguish them

– In function of expectations of usage

– In terms of storage, “editability” etc.

• Issue: coexistence at several levels of representation

• Logical and physical

– Logical concepts: chapter, paragraph, sentence, word

space

– Physical concepts: page, column, line of text

Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008

The Four Types of Digital Documents

The Paper

Document

The Digital

Document

Bitmap

PDL

Styled

Structured

Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008

Digital - Bitmap

• Document stored as an array of pixels

– Is really a digital “picture” of the document

• Simple 1-to-1 representation of the physical Document

• Examples: TIFF, GIF, BMP, PNG, JPG

• Large storage volume

• Little processing for imaging

• Essentially not “editable” (except with image processing tools);

no text reflow

Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008

Digital - Page Description

• Contains “objects”, such as characters (glyphs), graphics,

images, and a description of where (and sometimes how) they

appear on the page

• Examples: PostScript, PDF, PCL; but

– PostScript is a programming language

– PDF is a non-procedural data representation system

• Reasonably compact storage

• Processing required for imaging (“RIP”)

• Device independent

• Marginally editable (moving objects), but no text reflow

Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008

Digital - Styled Document

• Document contains styled and sequenced graphic elements,

and a limited amount of structure

• Example : RTF, DOC (MS Word Document), WPF

• Reasonably compact storage

• Requires processing for output (driver)

• Completely editable, and text may be reflowed

• But not structure-driven editing

Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008

Digital - Structured Document

• Document is highly structured, and structure-controlled

• Examples: SGML, XML

– HTML is hybrid (many properties of ML, some of RTF)

• Powerful concept of “document type definition” (DTD)

• High structure-controlled editability

– Text is contained in unprocessed elements; text reflow is

possible, because of its very representation

• Requires often complex editing tools

• Often used in technical documents

TransformExamples

Meaning

IntentSelect

Knowledge

Basic Content

FormExpress

LogicalStructure

Organize

Styled Content

Structured Content

Presentation Format

Style

Output Representation

ResourcesCompose

Raw Digital Image

Media Properties

Render

Physical Representation

DeviceProperties

Playback

Logical premises

Text, Artwork

Language, Pictorial, Musical, Gesture

DTD

DOC, WPF, RTF

SGML, XML, (HTML)

Style sheet, XSL

PDF, PS, PCL, MIDI

Fonts

TIFF, GIF, BMP, WAV

Page size, Screen Resolution

Paper, Sound, Video

Screens, CD, Audio cassette, VHS, DVD

Framemaker

Microsoft Word, Quark Xpress

Adobe Illustrator

Adobe Photoshop

XML parser

Postscript Driver

RIP, Speech & Sound Synthesizer

Marking engine, CRT, LCD, AV System

Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008

Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008

Starting from Paper

• What if the original Document is paper?

• Scan to Digital Document

– What “level” do we scan to?

• Digital-to-paper is “many-to-one”

– Green button operation

• Paper-to-digital is “one-to-many”

• Level depends on purpose

– For storage, bitmap level might be sufficient

– For edits, at least styled content level

Meaning

Intent

Knowledge

Learn

Basic Content

Form

LogicalStructure Fragment

Styled Content

Structured Content

Presentation Format

Re-Structure

Output Representation

ResourcesRecognize

Raw Digital Image

Media Properties Segment

Physical Representation

DeviceProperties Capture

Under-stand

UpwardTransforms

Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008

WPF

Meaning

Intent

Knowledge

Basic Content

Form

LogicalStructure

Styled Content

Structured Content

Presentation Format

Output Representation

Resources

Raw Digital Image

Media Properties

Physical Representation

DeviceProperties

RTFDOC

TIFF BMPGIF

English German French Translation

Contents Edits, Conversions

Processing, Format Conversion

ProductSpecification

CustomerDocumentation

Re-Targeting

SGML XMLHTML Structure Edits, Conversions

Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008

Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008

Examples of Applications of the Model

Meaning

Intent

Knowledge

Basic Content

Form

LogicalStructure

Styled Content

Structured Content

Presentation Format

Output Representation

Resources

Raw Digital Image

Media Properties

Physical Representation

DeviceProperties

AnalogCopier

Meaning

Intent

Knowledge

Basic Content

Form

LogicalStructure

Styled Content

Structured Content

Presentation Format

Output Representation

Resources

Raw Digital Image

Media Properties

Physical Representation

DeviceProperties

Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008

Meaning

Intent

Knowledge

Basic Content

Form

LogicalStructure

Styled Content

Structured Content

Presentation Format

Output Representation

Resources

Raw Digital Image

Media Properties

Physical Representation

DeviceProperties

Digital Copier

Meaning

Intent

Knowledge

Basic Content

Form

LogicalStructure

Styled Content

Structured Content

Presentation Format

Output Representation

Resources

Raw Digital Image

Media Properties

Physical Representation

DeviceProperties

Bitmap Bitmap

Image Processing

Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008

Meaning

Intent

Knowledge

Basic Content

Form

LogicalStructure

Styled Content

Structured Content

Presentation Format

Output Representation

Resources

Raw Digital Image

Media Properties

Physical Representation

DeviceProperties

Multi-Function DevicesMeaning

Intent

Knowledge

Basic Content

Form

LogicalStructure

Styled Content

Structured Content

Presentation Format

Output Representation

Resources

Raw Digital Image

Media Properties

Physical Representation

DeviceProperties

Bitmap Bitmap

Image Processing

RippingPDL

OCR

PDL

TextBridge

DOC

Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008

Meaning

IntentSelect

Knowledge

Learn

Basic Content

FormExpress

LogicalStructure

OrganizeFragment

Styled Content

Structured Content

Presentation Format

Style Re-Structure

Output Representation

ResourcesComposeRecognize

Raw Digital Image

Media Properties

RenderSegment

Physical Representation

DeviceProperties

PlaybackCapture

Under-stand

Translating Copier

Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008

Meaning

Intent

Knowledge

Basic Content

Form

LogicalStructure

Styled Content

Structured Content

Presentation Format

PDL ( PS, PDF, etc)

Resources

Raw Digital Image

Media Properties

Physical Representation

DeviceProperties

The function of the RIP

The Paper World

The Digital World

The Digital World

The Digital World

The Digital WorldOne-to-one

RIP

Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008

Meaning

Intent

Knowledge

Learn

Basic Content

Form

LogicalStructure Fragment

Styled Content

Structured Content

Presentation Format

Re-Structure

Output Representation

ResourcesRecognize

Raw Digital Image

Media Properties Segment

Physical Representation

DeviceProperties Capture

Under-stand

Steps to improve “OCR”

Meaning

Text content

Form

RecognizeLanguage

Trigram model

Morpho -analysis

Morpho-syntactics

Tagged Content

Basic Content

RecogniseBasic Form

„English‟ Text

SyntacticParsing

Content Dependencies

SyntacticStructure

Language

Languagesyntax

Content type

ExternalInformation

Pragmaticknowledge

Partsof speech

Semanticanalysis

LanguageSemantics

Semantic Content

SemanticStructure

Pragmaticanalysis

Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008

Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008

The Networked Document (1)

Distributed and HyperlinkedDocuments

Documents with Network Intelligence

Documents with Workflow Intelligence

Mobile Documents

The Paper

Document

The Digital

Document

Bitmap

PDL

Styled

Structured

The Networked

Document

The Paper

Document

The Digital

Document

Bitmap

PDL

Styled

Structured

Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008

The Networked Document (2)

• Parts of Documents are stored in different locations on the

network

– For example, images on an image server

– Or a large number of logically linked servers

• Documents are dynamically “assembled”

– When viewing

– When printing

• Requires networks with

– High performance

– High availability

• Technology for dynamic document assembly is Hyperlinking

Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008

The Networked Document (3)

• “Hypertext” was a major fundamental advance in Document

Storage Architecture

– Documents are “linked” to integrate external objects

– Powerfully implemented in HTML and XML

• HTML is “vulnerable”, unfit for a robust corporate Document

Management System

• XML is much better for linking purposes (through XLL)

• Network-based storage of corporate documents requires a

Document Storage Architecture with robust links and strong

link management

– Making a difference between Intranet & Internet

Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008

The Networked Document (4)

• Bi-directional linking, link registry, link ownership and link

lifetime management are key

Intranetwith full

object control

A

B2

B1

C3

C1

C2

X1

X3

X2

Z1

• C2 “knows” is is “used” by B1 and C3

• X2, Z1 don‟t “know” who uses them

Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008

The “Network Intelligent” Document (1)

• Documents which “adapt themselves” to the (limited)

bandwidth of the network

• Documents with “hierarchical” information representation

– On a printer

– On a display

– On a Portable Document Reader

• Requires several levels of representation

– Explicitly Stored internally

– Or automatic summarization

Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008

The “Network Intelligent” Document (2)

• Automatically generated at authoring time

– To be available when needed (like thumbnails)

• Reproduction adapted to available bandwidth or storage

Full Images

Full Text

t

Summary

URL

Small Bandwidth

Small StorageLarge Bandwidth

Large Storage

Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008

The “New Document” is …

• Live, Dynamic, Updatable

– “Freezes” when printed or converted to a static

(conventional digital) document, “lives on the net”

• Linked, Hyperlinked

– With links resolved at rendering time (printing or viewing)

– Implementing the ultimate “late binding” capability

– Inherently supporting variable/personalized publishing

– With “reverse link” control for document integrity

• Intelligent, Adaptable

– Integrates some of its own “workflow” procedures

– Understands the limitation of communication channels,

and of viewing or printing equipment, and adapts itself

• Auto-translating, Auto-summarizing

Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008

The “New Document” is …

Generator of a whole new range of activities, such as

• Document-based collaboration activities

– Collaborative authoring

• Document-based business and administrative processes

– Supports complex pruduct design and approval cycles

– Integrates document rights

– Integrates digital signatures and biometric data

• Document-based search methods and engines

– Search engines for the WWW

– Search engines for DMS

– Meta-search engines and restricted-domain search engines

Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008

The “New Document” is …

• Digital, of course, and …

• Polymorphic; exists in many different variations, media, formats,

etc

• Ubiquitous: linked and hyperlinked, distributed, dynamic and

mobile

• Actionable: supporting business processes and generating

activities unthinkable of a decade ago

Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008

Thank you

X

X