Transforming PDF into HTML

Transforming PDF into HTML Matt Kuznicki, CTO

Agenda

1. Challenges of PDF conversion2. Making convertible PDF from the start3. About all those other PDFs out there….4. Features of Datalogics PDF Alchemist5. Summary and concluding thoughts

A bit about me

CTO at Datalogics Worked with PDF for over 15 years Board member of PDF Association Active participant in the PDF

standards community

Challenges of PDF conversion PDF was designed to convey an exact

visual representation of information to humans

PDF’s origins did not account for storing and retrieving machine-understandable information

PDF is page and position based, lacks the notion of text flow and grouping*

Many different PDFs in the wild – some easy to interpret, some very complex

PDF designed to convey exact visual representation

Reliable visual representation, but many potential ways to make something that looks a certain way

Capability to tie semantic information to content came later on to PDF Use is increasing but still

far from the majority of content being produced

Most PDF generators still prefer smaller files to PDF files that are easier to repurpose

PDF designed for human consumption

At the time PDF was conceived as a PostScript replacement, reliable rendering for human readers was an important issue… Focus was on retrieving the information needed to display and print

pages for peoples’ use Affordances for machine “reading” were bolt-ons to the format Community has made great strides in allowing for machine interpretation,

but proper use requires expertise in the domain Structure and semantics are optional – usage is still rare

This is NOT a PDF specific issue

Like a TIFF or raster image, marks on a PDF page are precisely positioned and usually come in small discrete pieces

Humans automatically see a page flow that is not always present in the PDF syntax

Contents of a PDF page can be specified in an order very different from how we read

Words, images, other elements on a page may have the marks that constitute spread far throughout the page marking stream

PDF is page and position based

Creating Tagged PDF means you embed the information for repurposing and reflow directly into the PDF when it’s created – at the right time!

Easy to convert Tagged PDF into other formats

But, not all Tagged PDF is the same, and not generators emit useful Tagged PDF!

Avoid all this trouble at the start – if you can!

But how about all those other PDFs out there?

Existing PDFs aren’t going to magically gain structure semantics

Existing tools and workflows may not be upgradable in the near future – or at all

Not all files converted to PDF contain enough information for structure semantics in the first place

Is OCR the only way to handle these? No!

OCR is not always reliable in converting pictures of text back into actual text flows

Rasterizing PDFs to scan and turn back into non-raster form introduces multiple chances for errors and unexpected results

Conversion of PDF to HTML relies upon:

Seeing pages in a way like a human reads them

Figuring our the logical structure of the pages

Putting text back together into text flows

Putting all these elements out in the correct order

PDF AlchemistDatalogics PDF to HTML conversion technology

What does PDF Alchemist offer? Works on untagged PDFs – handles existing PDFs, does not

require workflow changes or regenerating/reconstructing source PDFs

Turns placed words in PDFs back into text flows – reflowable text

Re-creates tables and lists from page content Removes pagination artifacts such as page #s and running

headers Converts PDF into single-page HTML5 + CSS or into EPUB

packages Converts PDF forms into fixed-layout HTML forms for use in

mobile environments

Demonstrations

Conversion of a PDF file into an HTML file

Conversion of a PDF form into an HTML form

• Available as a command line tool for server and workflow integration

• Or as a simple “C” API for integration into programs

Using PDF Alchemist

Summary Most PDFs are and will continue to be made without

regard to repurposing

Reconstructing the content and flow of PDF relies upon advanced logic and mimicry of how humans read pages

PDF Alchemist offers this logic in an easy to use software package

Any Questions?Matt Kuznicki

Chief Technical [email protected]: mattkuznicki

Datalogics Inc.www.datalogics.com

Twitter: @DatalogicsInc

https://www.linkedin.com/in/mattkuznicki

http://www.datalogics.com/

Technology

Transforming PDF into HTML