Upload
datalogicsinc
View
125
Download
0
Embed Size (px)
Citation preview
Transforming PDF into HTML Matt Kuznicki, CTO
Agenda
1. Challenges of PDF conversion2. Making convertible PDF from the start3. About all those other PDFs out there….4. Features of Datalogics PDF Alchemist5. Summary and concluding thoughts
A bit about me
CTO at Datalogics Worked with PDF for over 15 years Board member of PDF Association Active participant in the PDF
standards community
Challenges of PDF conversion PDF was designed to convey an exact
visual representation of information to humans
PDF’s origins did not account for storing and retrieving machine-understandable information
PDF is page and position based, lacks the notion of text flow and grouping*
Many different PDFs in the wild – some easy to interpret, some very complex
PDF designed to convey exact visual representation
Reliable visual representation, but many potential ways to make something that looks a certain way
Capability to tie semantic information to content came later on to PDF Use is increasing but still
far from the majority of content being produced
Most PDF generators still prefer smaller files to PDF files that are easier to repurpose
PDF designed for human consumption
At the time PDF was conceived as a PostScript replacement, reliable rendering for human readers was an important issue… Focus was on retrieving the information needed to display and print
pages for peoples’ use Affordances for machine “reading” were bolt-ons to the format Community has made great strides in allowing for machine interpretation,
but proper use requires expertise in the domain Structure and semantics are optional – usage is still rare
This is NOT a PDF specific issue
Like a TIFF or raster image, marks on a PDF page are precisely positioned and usually come in small discrete pieces
Humans automatically see a page flow that is not always present in the PDF syntax
Contents of a PDF page can be specified in an order very different from how we read
Words, images, other elements on a page may have the marks that constitute spread far throughout the page marking stream
PDF is page and position based
Creating Tagged PDF means you embed the information for repurposing and reflow directly into the PDF when it’s created – at the right time!
Easy to convert Tagged PDF into other formats
But, not all Tagged PDF is the same, and not generators emit useful Tagged PDF!
Avoid all this trouble at the start – if you can!
But how about all those other PDFs out there?
Existing PDFs aren’t going to magically gain structure semantics
Existing tools and workflows may not be upgradable in the near future – or at all
Not all files converted to PDF contain enough information for structure semantics in the first place
Is OCR the only way to handle these? No!
OCR is not always reliable in converting pictures of text back into actual text flows
Rasterizing PDFs to scan and turn back into non-raster form introduces multiple chances for errors and unexpected results
Conversion of PDF to HTML relies upon:
Seeing pages in a way like a human reads them
Figuring our the logical structure of the pages
Putting text back together into text flows
Putting all these elements out in the correct order
PDF AlchemistDatalogics PDF to HTML conversion technology
What does PDF Alchemist offer? Works on untagged PDFs – handles existing PDFs, does not
require workflow changes or regenerating/reconstructing source PDFs
Turns placed words in PDFs back into text flows – reflowable text
Re-creates tables and lists from page content Removes pagination artifacts such as page #s and running
headers Converts PDF into single-page HTML5 + CSS or into EPUB
packages Converts PDF forms into fixed-layout HTML forms for use in
mobile environments
Demonstrations
Conversion of a PDF file into an HTML file
Conversion of a PDF form into an HTML form
• Available as a command line tool for server and workflow integration
• Or as a simple “C” API for integration into programs
Using PDF Alchemist
Summary Most PDFs are and will continue to be made without
regard to repurposing
Reconstructing the content and flow of PDF relies upon advanced logic and mimicry of how humans read pages
PDF Alchemist offers this logic in an easy to use software package
Any Questions?Matt Kuznicki
Chief Technical [email protected]: mattkuznicki
Datalogics Inc.www.datalogics.com
Twitter: @DatalogicsInc