19
TeX2Star A System for Converting TeX to OpenOffice By Jeffrey Starr

TeX2Star A System for Converting TeX to OpenOffice By Jeffrey Starr

Embed Size (px)

Citation preview

Page 1: TeX2Star A System for Converting TeX to OpenOffice By Jeffrey Starr

TeX2Star

A System for Converting TeX to OpenOfficeBy Jeffrey Starr

Page 2: TeX2Star A System for Converting TeX to OpenOffice By Jeffrey Starr

Overview

● Why does conversion matter?● Why has it not already been done?

– Why is it difficult?

● Proposal: TeX->OpenOffice● Proposal: TeX->DVI->OpenOffice● Solution● Unsolved problems

Page 3: TeX2Star A System for Converting TeX to OpenOffice By Jeffrey Starr

What is OpenOffice?

● Open Source office suite● Based on StarOffice, currently owned by Sun

Microsystems● Cross-Platform● XML based, standards driven● Semantic-based format

Page 4: TeX2Star A System for Converting TeX to OpenOffice By Jeffrey Starr

What is TeX?

● Written by Donald E. Knuth● Solution to declining standards

in mathematical typography● Heavily used in mathematics

and physics● Both a program and a

programming language● Presentation-based format

Page 5: TeX2Star A System for Converting TeX to OpenOffice By Jeffrey Starr

Why Bother to Convert?

● TeX rare outside mathematical circles● Conflicts with publishing software● Does not fit within current word processing

model● TeX's purpose to is to produce journal-quality

typography, not facilitate editing of content.

Page 6: TeX2Star A System for Converting TeX to OpenOffice By Jeffrey Starr

Aside: Editable Output

● TeX has many presentation outputs:– DVI– PostScript– PDF– PNG– TIFF– Fax

● TeX has no direct editable outputs.

Page 7: TeX2Star A System for Converting TeX to OpenOffice By Jeffrey Starr

Solution: TeX->OpenOffice

● Why use the outputs? Read the original document.

● Perfect knowledge of content and (presentational) intent

● Write a program that reads TeX and outputs OpenOffice, instead of DVI

Page 8: TeX2Star A System for Converting TeX to OpenOffice By Jeffrey Starr

Problems with TeX->OpenOffice

● TeX is a large system– Eight years development– Too large for a semester

● Irregular● Non-Balanced● Many special cases

Page 9: TeX2Star A System for Converting TeX to OpenOffice By Jeffrey Starr

TeX is Irregular

● An irregular language is one in which typical rules of processing are violated

● Irregular '\atop': (TeX)– {numerator \atop denominator}

● Regular '\frac': (LaTeX)– \frac{numerator}{denominator}

Page 10: TeX2Star A System for Converting TeX to OpenOffice By Jeffrey Starr

TeX is not balanced

● A language that is balanced will have an explicit beginning and end to each grouping

● Non-balanced font commands: (TeX)– \bf this is bold \rm this is normal, roman text

● Balanced font commands: (LaTeX)– \textbf{this is bold} this is back to normal

Page 11: TeX2Star A System for Converting TeX to OpenOffice By Jeffrey Starr

TeX has many special cases

● \par may either:– explicitly end a paragraph– do nothing (if in math mode)– do nothing (if in restricted horizontal mode)– tell TeX to build the current page

● \par is also irregular (acts on material already processed and in the reverse direction) and unbalanced (may or may not be proceeded by \indent, a primitive to start a paragraph)

Page 12: TeX2Star A System for Converting TeX to OpenOffice By Jeffrey Starr

Solution: TeX->DVI->OpenOffice

● Let TeX deal with TeX● Run TeX on the original text● Read the resultant DVI output● Process the DVI output to OpenOffice

Page 13: TeX2Star A System for Converting TeX to OpenOffice By Jeffrey Starr

Problem: Lack of semantic data

● DVI contains font definitions, text stream, and description of black boxes

● Fonts contain characters, but do not say what those characters are– Especially a problem with kerning “ff” vs. “ff”– Also a problem with bold and italics text --- bold and

italics are their own fonts

Page 14: TeX2Star A System for Converting TeX to OpenOffice By Jeffrey Starr

Solution: Add Annotations

● Use interpositioning and the TeX primitive '\special' to send extra information to DVI file

● \special leaves comments that can be read later● Reading the DVI with proper annotation allows

the text to retain some level of semantic information

● Difference between knowing that the next character is smaller and raised versus knowing that the next character is a superscript

Page 15: TeX2Star A System for Converting TeX to OpenOffice By Jeffrey Starr

Problem: Unbalanced Tags

● Some primitives are balanced, but many are not● Tags may affect the document for an arbitrary

length of time or are local to a paragraph or specific block of text

Page 16: TeX2Star A System for Converting TeX to OpenOffice By Jeffrey Starr

Solution: Balancing

● Algorithm:– Given: database of tags

● start tag, end tag, 'insert end tag' tags

– Go through list of tags, find one that needs help balancing

– Go forward along list, finding nearest tag that closes the previous tag, or end of document

– Insert end of tag into the list of tags

Page 17: TeX2Star A System for Converting TeX to OpenOffice By Jeffrey Starr

Post Document Editing

● Further balancing and insertion of tags may be necessary after first sweep through file

● Tables:– OpenOffice format requires number of columns to be

specified– We don't know how many columns will be needed

until after we read the entire table– Solution: After processing, go back and insert the

needed information

Page 18: TeX2Star A System for Converting TeX to OpenOffice By Jeffrey Starr

Unsolved Problems

● Footnotes:– Defined by position in the page– Automatic positioning conflicts with paragraph

detection tool– Unable to discern between footnotes, extra paragraph,

header, or footer

● Non-English alphabets

Page 19: TeX2Star A System for Converting TeX to OpenOffice By Jeffrey Starr

Conclusion

● Semantics of document are lost in TeX itself, so no hope of recovery

● Overt presentation can be recovered for editing● Method works to translate an irregular, non-well

formed language into a regular, well-formed language (XML)