View
1
Download
0
Category
Preview:
Citation preview
X
X
Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008
The New DocumentDigital
Polymorphic
Ubiquitous
Actionable
Patrick P. Bergmans
University of Ghent
Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008
The Traditional Document
• Documents have been around for thousands of years
– The Bible is a document
– The scrolls of the Dead Sea are documents
– Hieroglyphs of the Ancient Egypt are documents
• Documents have been and continue to be the support of a
large fraction of human knowledge
• Documents are stored on a specific medium
– For centuries, the traditional medium for documents has
been paper
– Recently, the storage medium for documents has become
digital
Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008
The Traditional Document
• Paper documents were a fairly simple concept
• Digital documents are much more complex, because of their
numerous additional attributes
• The Digital Document is polymorphic; it has many, many
different embodiments and representations
• Computer Scientists have introduced formal Document Models
• These models are used to
– Analyze document transformations and evolutions
– Identify resources needed for those transformations
– Define Document Processes that govern these
transformations
Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008
Dimensions of Document Space
• In these models, documents are contained in a multi-
dimensional document space (content, structure, format, time,
spatial, others … ), identifying their specific properties along
the axes of the space
• Documents transformations are trajectories in document
space, describing the life of a document and its evolutions
• The multidimensional document space can be simplified by
projecting it onto sub-spaces
• The initial (content, structure, format) model considers the sub-
space of documents independently of time and space
Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008
Expected Model Benefits
• Precise definitions (giving common terminology)
• Definition of (generic) operators for document transformation
– Copy, Move, Erase, Print, ...
• Explicitly show where conceptual difficulties lie, giving some
ideas of their fundamental nature
• Enable reasoning on document transitions (e.g. versioning and
properties inheritances, document rights)
Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008
Content-Structure-Format Model
(three-dimensional projection)
Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008
Content-Structure-Format model
• This sub-space of the full document model can be used to
illustrate how knowledge, meaning and content are derived
and transformed during the structuring, formatting and physical
output of a document
• Vertical axis is some sort of overall “evolutionary” axis, but not
exactly a time axis
• Local Transformations are
– Content transformations at any time
– Structure transformations in the structured document plane
– Format transformations in the styled document plane
Logical premises
Text, Artwork
Language, Pictorial, Musical, Gesture
DTD
DOC, WPF, RTF, (HTML)
SGML, XML, (HTML)
Style sheet, XSL
PDF, PS, PCL, MIDI
Fonts
TIFF, GIF, BMP, WAV
Page size, Screen Resolution
Paper, Sound, Video, Voice
Screens, CD, Audio cassette, VHS, Minidisk, DVD
Meaning
Intent
Knowledge
Basic Content
Form
LogicalStructure
Styled Content
Structured Content
Presentation Format
Output Representation
Resources
Raw Digital Image
Media Properties
Physical Representation
DeviceProperties
Layer &PropertyExamples
Structure/Format planes
TransitionProperties
Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008
Digital Documents
Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008
Digital Documents
• There are many forms of Digital Documents
• It is extremely important to distinguish them
– In function of expectations of usage
– In terms of storage, “editability” etc.
• Issue: coexistence at several levels of representation
• Logical and physical
– Logical concepts: chapter, paragraph, sentence, word
space
– Physical concepts: page, column, line of text
Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008
The Four Types of Digital Documents
The Paper
Document
The Digital
Document
Bitmap
PDL
Styled
Structured
Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008
Digital - Bitmap
• Document stored as an array of pixels
– Is really a digital “picture” of the document
• Simple 1-to-1 representation of the physical Document
• Examples: TIFF, GIF, BMP, PNG, JPG
• Large storage volume
• Little processing for imaging
• Essentially not “editable” (except with image processing tools);
no text reflow
Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008
Digital - Page Description
• Contains “objects”, such as characters (glyphs), graphics,
images, and a description of where (and sometimes how) they
appear on the page
• Examples: PostScript, PDF, PCL; but
– PostScript is a programming language
– PDF is a non-procedural data representation system
• Reasonably compact storage
• Processing required for imaging (“RIP”)
• Device independent
• Marginally editable (moving objects), but no text reflow
Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008
Digital - Styled Document
• Document contains styled and sequenced graphic elements,
and a limited amount of structure
• Example : RTF, DOC (MS Word Document), WPF
• Reasonably compact storage
• Requires processing for output (driver)
• Completely editable, and text may be reflowed
• But not structure-driven editing
Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008
Digital - Structured Document
• Document is highly structured, and structure-controlled
• Examples: SGML, XML
– HTML is hybrid (many properties of ML, some of RTF)
• Powerful concept of “document type definition” (DTD)
• High structure-controlled editability
– Text is contained in unprocessed elements; text reflow is
possible, because of its very representation
• Requires often complex editing tools
• Often used in technical documents
TransformExamples
Meaning
IntentSelect
Knowledge
Basic Content
FormExpress
LogicalStructure
Organize
Styled Content
Structured Content
Presentation Format
Style
Output Representation
ResourcesCompose
Raw Digital Image
Media Properties
Render
Physical Representation
DeviceProperties
Playback
Logical premises
Text, Artwork
Language, Pictorial, Musical, Gesture
DTD
DOC, WPF, RTF
SGML, XML, (HTML)
Style sheet, XSL
PDF, PS, PCL, MIDI
Fonts
TIFF, GIF, BMP, WAV
Page size, Screen Resolution
Paper, Sound, Video
Screens, CD, Audio cassette, VHS, DVD
Framemaker
Microsoft Word, Quark Xpress
Adobe Illustrator
Adobe Photoshop
XML parser
Postscript Driver
RIP, Speech & Sound Synthesizer
Marking engine, CRT, LCD, AV System
Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008
Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008
Starting from Paper
• What if the original Document is paper?
• Scan to Digital Document
– What “level” do we scan to?
• Digital-to-paper is “many-to-one”
– Green button operation
• Paper-to-digital is “one-to-many”
• Level depends on purpose
– For storage, bitmap level might be sufficient
– For edits, at least styled content level
Meaning
Intent
Knowledge
Learn
Basic Content
Form
LogicalStructure Fragment
Styled Content
Structured Content
Presentation Format
Re-Structure
Output Representation
ResourcesRecognize
Raw Digital Image
Media Properties Segment
Physical Representation
DeviceProperties Capture
Under-stand
UpwardTransforms
Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008
WPF
Meaning
Intent
Knowledge
Basic Content
Form
LogicalStructure
Styled Content
Structured Content
Presentation Format
Output Representation
Resources
Raw Digital Image
Media Properties
Physical Representation
DeviceProperties
RTFDOC
TIFF BMPGIF
English German French Translation
Contents Edits, Conversions
Processing, Format Conversion
ProductSpecification
CustomerDocumentation
Re-Targeting
SGML XMLHTML Structure Edits, Conversions
Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008
Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008
Examples of Applications of the Model
Meaning
Intent
Knowledge
Basic Content
Form
LogicalStructure
Styled Content
Structured Content
Presentation Format
Output Representation
Resources
Raw Digital Image
Media Properties
Physical Representation
DeviceProperties
AnalogCopier
Meaning
Intent
Knowledge
Basic Content
Form
LogicalStructure
Styled Content
Structured Content
Presentation Format
Output Representation
Resources
Raw Digital Image
Media Properties
Physical Representation
DeviceProperties
Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008
Meaning
Intent
Knowledge
Basic Content
Form
LogicalStructure
Styled Content
Structured Content
Presentation Format
Output Representation
Resources
Raw Digital Image
Media Properties
Physical Representation
DeviceProperties
Digital Copier
Meaning
Intent
Knowledge
Basic Content
Form
LogicalStructure
Styled Content
Structured Content
Presentation Format
Output Representation
Resources
Raw Digital Image
Media Properties
Physical Representation
DeviceProperties
Bitmap Bitmap
Image Processing
Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008
Meaning
Intent
Knowledge
Basic Content
Form
LogicalStructure
Styled Content
Structured Content
Presentation Format
Output Representation
Resources
Raw Digital Image
Media Properties
Physical Representation
DeviceProperties
Multi-Function DevicesMeaning
Intent
Knowledge
Basic Content
Form
LogicalStructure
Styled Content
Structured Content
Presentation Format
Output Representation
Resources
Raw Digital Image
Media Properties
Physical Representation
DeviceProperties
Bitmap Bitmap
Image Processing
RippingPDL
OCR
PDL
TextBridge
DOC
Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008
Meaning
IntentSelect
Knowledge
Learn
Basic Content
FormExpress
LogicalStructure
OrganizeFragment
Styled Content
Structured Content
Presentation Format
Style Re-Structure
Output Representation
ResourcesComposeRecognize
Raw Digital Image
Media Properties
RenderSegment
Physical Representation
DeviceProperties
PlaybackCapture
Under-stand
Translating Copier
Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008
Meaning
Intent
Knowledge
Basic Content
Form
LogicalStructure
Styled Content
Structured Content
Presentation Format
PDL ( PS, PDF, etc)
Resources
Raw Digital Image
Media Properties
Physical Representation
DeviceProperties
The function of the RIP
The Paper World
The Digital World
The Digital World
The Digital World
The Digital WorldOne-to-one
RIP
Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008
Meaning
Intent
Knowledge
Learn
Basic Content
Form
LogicalStructure Fragment
Styled Content
Structured Content
Presentation Format
Re-Structure
Output Representation
ResourcesRecognize
Raw Digital Image
Media Properties Segment
Physical Representation
DeviceProperties Capture
Under-stand
Steps to improve “OCR”
Meaning
Text content
Form
RecognizeLanguage
Trigram model
Morpho -analysis
Morpho-syntactics
Tagged Content
Basic Content
RecogniseBasic Form
„English‟ Text
SyntacticParsing
Content Dependencies
SyntacticStructure
Language
Languagesyntax
Content type
ExternalInformation
Pragmaticknowledge
Partsof speech
Semanticanalysis
LanguageSemantics
Semantic Content
SemanticStructure
Pragmaticanalysis
Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008
Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008
The Networked Document (1)
Distributed and HyperlinkedDocuments
Documents with Network Intelligence
Documents with Workflow Intelligence
Mobile Documents
The Paper
Document
The Digital
Document
Bitmap
PDL
Styled
Structured
The Networked
Document
The Paper
Document
The Digital
Document
Bitmap
PDL
Styled
Structured
Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008
The Networked Document (2)
• Parts of Documents are stored in different locations on the
network
– For example, images on an image server
– Or a large number of logically linked servers
• Documents are dynamically “assembled”
– When viewing
– When printing
• Requires networks with
– High performance
– High availability
• Technology for dynamic document assembly is Hyperlinking
Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008
The Networked Document (3)
• “Hypertext” was a major fundamental advance in Document
Storage Architecture
– Documents are “linked” to integrate external objects
– Powerfully implemented in HTML and XML
• HTML is “vulnerable”, unfit for a robust corporate Document
Management System
• XML is much better for linking purposes (through XLL)
• Network-based storage of corporate documents requires a
Document Storage Architecture with robust links and strong
link management
– Making a difference between Intranet & Internet
Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008
The Networked Document (4)
• Bi-directional linking, link registry, link ownership and link
lifetime management are key
Intranetwith full
object control
A
B2
B1
C3
C1
C2
X1
X3
X2
Z1
• C2 “knows” is is “used” by B1 and C3
• X2, Z1 don‟t “know” who uses them
Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008
The “Network Intelligent” Document (1)
• Documents which “adapt themselves” to the (limited)
bandwidth of the network
• Documents with “hierarchical” information representation
– On a printer
– On a display
– On a Portable Document Reader
• Requires several levels of representation
– Explicitly Stored internally
– Or automatic summarization
Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008
The “Network Intelligent” Document (2)
• Automatically generated at authoring time
– To be available when needed (like thumbnails)
• Reproduction adapted to available bandwidth or storage
Full Images
Full Text
t
Summary
URL
Small Bandwidth
Small StorageLarge Bandwidth
Large Storage
Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008
The “New Document” is …
• Live, Dynamic, Updatable
– “Freezes” when printed or converted to a static
(conventional digital) document, “lives on the net”
• Linked, Hyperlinked
– With links resolved at rendering time (printing or viewing)
– Implementing the ultimate “late binding” capability
– Inherently supporting variable/personalized publishing
– With “reverse link” control for document integrity
• Intelligent, Adaptable
– Integrates some of its own “workflow” procedures
– Understands the limitation of communication channels,
and of viewing or printing equipment, and adapts itself
• Auto-translating, Auto-summarizing
Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008
The “New Document” is …
Generator of a whole new range of activities, such as
• Document-based collaboration activities
– Collaborative authoring
• Document-based business and administrative processes
– Supports complex pruduct design and approval cycles
– Integrates document rights
– Integrates digital signatures and biometric data
• Document-based search methods and engines
– Search engines for the WWW
– Search engines for DMS
– Meta-search engines and restricted-domain search engines
Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008
The “New Document” is …
• Digital, of course, and …
• Polymorphic; exists in many different variations, media, formats,
etc
• Ubiquitous: linked and hyperlinked, distributed, dynamic and
mobile
• Actionable: supporting business processes and generating
activities unthinkable of a decade ago
Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008
Thank you
X
X
Recommended