13
Rheinisch- Westfälische Technische Hochschule Aachen Department of Computer Science III, Prof. Dr.- Ing. M. Nagl September 14, 2000 / Digital Documents and Electronic Publishing 2000 Inferring Structure Information from Typography Christian Fuß Dipl.-Inform. Felix Gatzemeier Michael Kirchhof Dipl.-Inform. Oliver Meyer Department of Computer Science III, RWTH Aachen

Inferring Structure Information from Typography

  • Upload
    howard

  • View
    26

  • Download
    0

Embed Size (px)

DESCRIPTION

Inferring Structure Information from Typography. Christian Fuß Dipl.-Inform. Felix Gatzemeier Michael Kirchhof Dipl.-Inform. Oliver Meyer Department of Computer Science III, RWTH Aachen. Overview. Context Deriving Structure Information: Partitioning Typographic abstraction - PowerPoint PPT Presentation

Citation preview

Page 1: Inferring Structure Information from Typography

Rheinisch-WestfälischeTechnischeHochschuleAachenDepartment of Computer Science III, Prof. Dr.-Ing. M. Nagl

September 14, 2000 / Digital Documents and Electronic Publishing 2000

Inferring Structure Informationfrom Typography

Christian Fuß

Dipl.-Inform. Felix Gatzemeier

Michael Kirchhof

Dipl.-Inform. Oliver Meyer

Department of Computer Science III, RWTH Aachen

Page 2: Inferring Structure Information from Typography

Department of Computer Science III

RWTH Aachen

DDEP 2000

Inferring Structure Information from Typography

2

Overview Context

Deriving Structure Information:» Partitioning» Typographic abstraction» Determine Type

Conclusion

Cooperation project of

Prototype aTool in the WEP goupof the Global-Info Project (www.global-info.org)

Page 3: Inferring Structure Information from Typography

Department of Computer Science III

RWTH Aachen

DDEP 2000

Inferring Structure Information from Typography

3

Author

Proprietarydocument format

Writing

Today’s Publication Chain

CopyEditing

< > <>< >< >

Typesetting

Publisher

< >< >

Standard format

Reading

Reader

Conversion

Web Publ.

Page 4: Inferring Structure Information from Typography

Department of Computer Science III

RWTH Aachen

DDEP 2000

Inferring Structure Information from Typography

4

TEX

Submissions

Classification of Submissions

MS Word

Unformatted

Formatted

Correctly Formatted

Somehow FormattedStructured

(XML)

Somehow FormattedFormatted

Unformatted

Structured(XML)

Somehow Formatted

Page 5: Inferring Structure Information from Typography

Department of Computer Science III

RWTH Aachen

DDEP 2000

Inferring Structure Information from Typography

5

Basic Assumptions

Textual Nature

Typographic markup

Consistent markup

Known target document type

Page 6: Inferring Structure Information from Typography

Department of Computer Science III

RWTH Aachen

DDEP 2000

Inferring Structure Information from Typography

6

Deriving Structure Information

In: MS Word document

Record Formatting (Format Tuples)

Locate the Elements

Reduce Format Tuples to Patterns

Determine Types

Out: XML document

Also interactively

Page 7: Inferring Structure Information from Typography

Department of Computer Science III

RWTH Aachen

DDEP 2000

Inferring Structure Information from Typography

7

Format Tuples

The basic typographic abstraction

FormatTuple("Is this a dagger?") = [Times, 22pt, regular, roman]

Here: Font, Size, Weight, Variation

Planned: Search expressions modulo Text

More general:Including regular expressions oftext content or context.

Page 8: Inferring Structure Information from Typography

Department of Computer Science III

RWTH Aachen

DDEP 2000

Inferring Structure Information from Typography

8

Locate the Elements

Tree-Partitioning of Formatted Character Streams on» Format Tuple changes» Paragraphs breaks

Nesting of Inline Elements» Is this a dagger? <ft1>» Is this a dagger? <ft1 <ft2> ft1>

» Is this a dagger? <ft1 <ft2> >» Is this a dagger? <ft1 <ft2 <ft3> > >

Format-To-Type Map: FormatTuple ElementType

ft1 (times, 22pt, reg, roman) dummyType1ft2 (times, 22pt, bold, roman) dummyType2ft3 (times, 22pt, reg, italic) dummyType3

Page 9: Inferring Structure Information from Typography

Department of Computer Science III

RWTH Aachen

DDEP 2000

Inferring Structure Information from Typography

9

Format patterns

Identity too restrictive wildcard generalization

Is this a dagger? (,,)

Times Times Times *22pt 22pt 22pt *regular bold regular boldroman roman roman *

(, a, b) = (a, a, b); (a, b, ) = (a, b, b)

(, a, ) propagated to paragraph level

Format-To-Type Map: FormatPattern ElementType

fp1 (*, *, regular, *) dummyType1fp2 (*, *, bold, *) dummyType2fp2b (*, *, bold, roman) dummyType2fp3 (*, *, regular, italic) dummyType3

Page 10: Inferring Structure Information from Typography

Department of Computer Science III

RWTH Aachen

DDEP 2000

Inferring Structure Information from Typography

10

Determine Types

Replace dummy types in Format-To-Type Map

Preconfiguration by publisher

Controlled Learning from the author

FormatPattern ElementType

(*, *, regular, *) Body(*, *, bold, *) FirstTerm(*, *, bold, roman) FirstTerm (*, *, regular, italic) Emphasis

Page 11: Inferring Structure Information from Typography

Department of Computer Science III

RWTH Aachen

DDEP 2000

Inferring Structure Information from Typography

11

Further useable information

Allowed context from the DTD

Paragraph standard format

Text patterns» Bullets» Enumeration» Whitespace» ASCII Markup (Is *this* a dagger?)

Format pattern match confidence

Page 12: Inferring Structure Information from Typography

Department of Computer Science III

RWTH Aachen

DDEP 2000

Inferring Structure Information from Typography

12

Motivational aspects

Quick feedback on formal correctness

Publication preview while keeping format freedom

(Via XSL) flexible previews of other formats

New structure-based functionality:» Structure editing» Structure evaluation» Document templates

Page 13: Inferring Structure Information from Typography

Department of Computer Science III

RWTH Aachen

DDEP 2000

Inferring Structure Information from Typography

13

Conclusion

Summary» 4-step inference

Record format tuples Locate the elements Reduce tuples to patterns Determine types

» Increase efficiency of publication chain» Provide unobtrusive structuring for non-expert authors

Plans» Cautious extension of inference» Validation of document» Evaluation with authors