Upload
howard
View
26
Download
0
Embed Size (px)
DESCRIPTION
Inferring Structure Information from Typography. Christian Fuß Dipl.-Inform. Felix Gatzemeier Michael Kirchhof Dipl.-Inform. Oliver Meyer Department of Computer Science III, RWTH Aachen. Overview. Context Deriving Structure Information: Partitioning Typographic abstraction - PowerPoint PPT Presentation
Citation preview
Rheinisch-WestfälischeTechnischeHochschuleAachenDepartment of Computer Science III, Prof. Dr.-Ing. M. Nagl
September 14, 2000 / Digital Documents and Electronic Publishing 2000
Inferring Structure Informationfrom Typography
Christian Fuß
Dipl.-Inform. Felix Gatzemeier
Michael Kirchhof
Dipl.-Inform. Oliver Meyer
Department of Computer Science III, RWTH Aachen
Department of Computer Science III
RWTH Aachen
DDEP 2000
Inferring Structure Information from Typography
2
Overview Context
Deriving Structure Information:» Partitioning» Typographic abstraction» Determine Type
Conclusion
Cooperation project of
Prototype aTool in the WEP goupof the Global-Info Project (www.global-info.org)
Department of Computer Science III
RWTH Aachen
DDEP 2000
Inferring Structure Information from Typography
3
Author
Proprietarydocument format
Writing
Today’s Publication Chain
CopyEditing
< > <>< >< >
Typesetting
Publisher
< >< >
Standard format
Reading
Reader
Conversion
Web Publ.
Department of Computer Science III
RWTH Aachen
DDEP 2000
Inferring Structure Information from Typography
4
TEX
Submissions
Classification of Submissions
MS Word
Unformatted
Formatted
Correctly Formatted
Somehow FormattedStructured
(XML)
Somehow FormattedFormatted
Unformatted
Structured(XML)
Somehow Formatted
Department of Computer Science III
RWTH Aachen
DDEP 2000
Inferring Structure Information from Typography
5
Basic Assumptions
Textual Nature
Typographic markup
Consistent markup
Known target document type
Department of Computer Science III
RWTH Aachen
DDEP 2000
Inferring Structure Information from Typography
6
Deriving Structure Information
In: MS Word document
Record Formatting (Format Tuples)
Locate the Elements
Reduce Format Tuples to Patterns
Determine Types
Out: XML document
Also interactively
Department of Computer Science III
RWTH Aachen
DDEP 2000
Inferring Structure Information from Typography
7
Format Tuples
The basic typographic abstraction
FormatTuple("Is this a dagger?") = [Times, 22pt, regular, roman]
Here: Font, Size, Weight, Variation
Planned: Search expressions modulo Text
More general:Including regular expressions oftext content or context.
Department of Computer Science III
RWTH Aachen
DDEP 2000
Inferring Structure Information from Typography
8
Locate the Elements
Tree-Partitioning of Formatted Character Streams on» Format Tuple changes» Paragraphs breaks
Nesting of Inline Elements» Is this a dagger? <ft1>» Is this a dagger? <ft1 <ft2> ft1>
» Is this a dagger? <ft1 <ft2> >» Is this a dagger? <ft1 <ft2 <ft3> > >
Format-To-Type Map: FormatTuple ElementType
ft1 (times, 22pt, reg, roman) dummyType1ft2 (times, 22pt, bold, roman) dummyType2ft3 (times, 22pt, reg, italic) dummyType3
Department of Computer Science III
RWTH Aachen
DDEP 2000
Inferring Structure Information from Typography
9
Format patterns
Identity too restrictive wildcard generalization
Is this a dagger? (,,)
Times Times Times *22pt 22pt 22pt *regular bold regular boldroman roman roman *
(, a, b) = (a, a, b); (a, b, ) = (a, b, b)
(, a, ) propagated to paragraph level
Format-To-Type Map: FormatPattern ElementType
fp1 (*, *, regular, *) dummyType1fp2 (*, *, bold, *) dummyType2fp2b (*, *, bold, roman) dummyType2fp3 (*, *, regular, italic) dummyType3
Department of Computer Science III
RWTH Aachen
DDEP 2000
Inferring Structure Information from Typography
10
Determine Types
Replace dummy types in Format-To-Type Map
Preconfiguration by publisher
Controlled Learning from the author
FormatPattern ElementType
(*, *, regular, *) Body(*, *, bold, *) FirstTerm(*, *, bold, roman) FirstTerm (*, *, regular, italic) Emphasis
Department of Computer Science III
RWTH Aachen
DDEP 2000
Inferring Structure Information from Typography
11
Further useable information
Allowed context from the DTD
Paragraph standard format
Text patterns» Bullets» Enumeration» Whitespace» ASCII Markup (Is *this* a dagger?)
Format pattern match confidence
Department of Computer Science III
RWTH Aachen
DDEP 2000
Inferring Structure Information from Typography
12
Motivational aspects
Quick feedback on formal correctness
Publication preview while keeping format freedom
(Via XSL) flexible previews of other formats
New structure-based functionality:» Structure editing» Structure evaluation» Document templates
Department of Computer Science III
RWTH Aachen
DDEP 2000
Inferring Structure Information from Typography
13
Conclusion
Summary» 4-step inference
Record format tuples Locate the elements Reduce tuples to patterns Determine types
» Increase efficiency of publication chain» Provide unobtrusive structuring for non-expert authors
Plans» Cautious extension of inference» Validation of document» Evaluation with authors