28
Intelligently Extracting Data from PDFs Presented by Matt Kuznicki Chief Technical Officer, Datalogics

Intelligent Content Extraction from PDFs

Embed Size (px)

Citation preview

Page 1: Intelligent Content Extraction from PDFs

Intelligently Extracting Data from PDFs

Presented by Matt KuznickiChief Technical Officer, Datalogics

Page 2: Intelligent Content Extraction from PDFs

Agenda

• Technical Challenges in PDF Data Extraction• Key Considerations for Data Extraction• Use Cases• About Datalogics PDF Alchemist

Page 3: Intelligent Content Extraction from PDFs

About Me• Chief Technical Officer at Datalogics• Vice Chairman of PDF Association Board of Directors• Worked extensively with PDF for over 15 years• Active participant in the PDF standards community

Page 4: Intelligent Content Extraction from PDFs

Technical Challenges in PDF Data Extraction

Page 5: Intelligent Content Extraction from PDFs

Extraction: Technical Challenges

• PDF is a page description language – elements typically have fixed position on a physical plane

• Elements are not necessarily defined in order of appearance

• Richer vocabulary for expressing elements than other formats

• Structure and semantics of elements not commonly stated

Page 6: Intelligent Content Extraction from PDFs

At the time PDF was conceived in the 1990s, reliable rendering for human readers was an important issue• Focus was on retrieving the information needed to display and

print pages for peoples’ use• Affordances to give content semantics came much later• Community has made great strides in allowing for machine

interpretation, but proper use requires expertise in the domain • Structure and semantics are optional – usage is still rare

• This is NOT a PDF specific issue

PDF as Page Description Language

Page 7: Intelligent Content Extraction from PDFs

PDF as Page Description Language• PDF format most concerned with expressing exact visual

representation• Elements are placed at fixed positions on virtual pages, in small

discrete pieces• Not as fine-grained as individual dots in a raster file, but not as

continuous content like most HTML• No guarantee of sentences or even letters grouped together to

form whole words in a PDF data stream• Usually PDF files contain no information about how elements relate to

each other

Page 8: Intelligent Content Extraction from PDFs

PDF as Page Description LanguagePDF pages often contain content that is a byproduct of breaking data into page-size chunks, such as:• Page numbers• Page headers and footers• Guides and information for printingThese elements are not usually considered real document data, extracting these as content is usually undesired.

Page 9: Intelligent Content Extraction from PDFs

Elements and OrderingSmall graphic elements can mean big extraction problems:• Contents of a PDF page can be specified in an order very different

from how we read• Humans automatically see a page flow that is not always present in

the PDF data stream• Words, images and other elements on a page may have the marks

that constitute them spread far throughout the page marking stream• Without ordering information, flow of PDF content must be

heuristically derived and is subject to differing interpretations

Page 10: Intelligent Content Extraction from PDFs

Richer Vocabulary For ElementsPDF includes a richer way to express elements than most other languages:• Images can be in many different forms, including GIF, JPEG, PNG, JPEG 2000

and JBIG2 derived formats• Fonts can be in several forms, including OpenType, TrueType, Type 1, CFF,

multiple master; or expressed in PDF element syntax• Text may be expressed in a way that includes Unicode information – or in one of

hundreds of encodings – but no Unicode information is actually required• Rich transparency and blending model allows for complex element interaction• Content may be optionally present or absent from a page depending on a

number of different triggers and conditions

Page 11: Intelligent Content Extraction from PDFs

Structure and SemanticsInformation on the structure and semantics of a PDF page is usually not present:• Lists are really just bunches of words and sometimes symbols humans

interpret as bullets or delimiters• Tables are really just a series of lines and shaded boxes, and bunches

of words, that humans interpret together as rows, columns and headers

• Paragraphs are really just bunches of words positioned on a page in such a way that humans interpret them as sentences grouped together

• Columns don’t exist in the PDF data stream, it’s just that us humans see elements grouped in a way that suggests columns

Page 12: Intelligent Content Extraction from PDFs

Structure and SemanticsWhen creating PDFs, it is possible to include structure and semantics into the PDF:• Creating tagged PDF means the information for conversion is

included directly into the PDF when it’s created – at the right time!• Easy to convert tagged PDF into other formats and to reflow• Not all tagged PDF is of good quality – and not all generators

emit useful tagged PDF! Bottom line: you can’t count on getting PDF that has easily extractable content!

Page 13: Intelligent Content Extraction from PDFs

Key Considerations For Data Extraction

Page 14: Intelligent Content Extraction from PDFs

Extraction: Key Considerations

• Content extraction means different things to different audiences

• Know your audience and its goals• Different goals are best met through different means

Page 15: Intelligent Content Extraction from PDFs

Extraction: Different Meanings

Let’s take a PDF that’s just one image of a scanned page:

Page 16: Intelligent Content Extraction from PDFs

Extraction: Different Meanings

Let’s take a PDF that’s just one image of a scanned page:• Does extracting the content mean returning the image?• Does extracting the content mean OCRing the image and

returning the text?If the PDF is an image and text underneath – is the content the image, the text, or both?

Page 17: Intelligent Content Extraction from PDFs

Know Your Audience’s Goals

Different audiences have different needs:• Extraction for indexing or summarization typically requires a

pure text stream of paragraphs• Extraction for loading contents into a database for machine

learning typically does not need appearance preservation• Extraction for presentation on a different screen or medium

typically means content order should be preserved but the appearance is expected to change

Page 18: Intelligent Content Extraction from PDFs

Different Goals, Different Means

Different goals mean different trade-offs:• Indexing, machine learning, data mining – preservation of text

and reconstruction of semantics most important• Reformatting for reflow or format conversion – balance between

text preservation and appearance preservation needed• Reformatting for reliable viewing across devices – appearance

preservation most important, text preservation secondary• Semantic reconstruction usually not required

Page 19: Intelligent Content Extraction from PDFs

Use Cases

Page 20: Intelligent Content Extraction from PDFs

Use Cases for Content Extraction

• Conversion to HTML for viewing PDF without a PDF viewer• Converting PDF into a reflowable HTML representation• Extraction of PDF contents for machine understanding

Page 21: Intelligent Content Extraction from PDFs

Viewing PDF Without a PDF Viewer

PDF extraction and conversion revolves around visual appearance:• Extract content and into a 1 to 1 analogue in a different fixed

layout (HTML + SVG, raster image, print-out, etc.)• Convert extracted content into different visual primatives• Reliable viewing, but maintains disadvantages of PDF formatThis is the simplest and easiest way to convert PDF content for human reading – but doesn’t extract the content into a useful form for machines

Page 22: Intelligent Content Extraction from PDFs

Converting PDF Into Reflowable HTMLPDF extraction and conversion balances needs of humans and machine understanding:• Elements are analyzed in page context and turned back into text flows,

lists, tables, and other structured elements• Elements that can’t be expressed in HTML are usually rendered to allow

proper viewing, at the loss of search-ability• Navigation elements – bookmarks, links – are converted into HTML

equivalents for easy browsing• Pagination artifacts are discarded when possibleResulting HTML is reflowable and gives good document reading experience, but appearance typically changes somewhat to be more “HTML-ish”

Page 23: Intelligent Content Extraction from PDFs

Extraction of PDF Contents For Machine UnderstandingPDF extraction focused on text and structure:• Elements are analyzed in page context and turned back

into text flows, lists, tables, and other structured elements• Text elements that can’t be expressed in HTML are usually

left as text, sacrificing visual fidelity• Navigation elements – bookmarks, links – are converted so

that automated processes can crawl these• Pagination artifacts should be discarded when possible

Page 24: Intelligent Content Extraction from PDFs

About Datalogics PDF Alchemist

Page 25: Intelligent Content Extraction from PDFs

Datalogics PDF Alchemist• Works on untagged PDFs – handles existing PDFs, does not

require workflow changes or regenerating/reconstructing source PDFs

• Turns placed words in PDFs back into reflowable text• Re-creates tables and lists from page content• Removes pagination artifacts such as page #s and running

headers• Converts PDF into single-page HTML5 + CSS or into EPUB

packages• Converts PDF forms into fixed-layout HTML forms for use in mobile

environments

Page 26: Intelligent Content Extraction from PDFs

Summary

Page 27: Intelligent Content Extraction from PDFs

Extracting Content from PDFsIntelligently extracting content from PDF files requires:• Seeing pages in a way like a human reads them• Figuring our the logical structure of the pages• Putting text back together into text flows• Putting all these elements back together in the correct order• Compensating intelligently for differences between PDF and

the chosen method of receiving content

Page 28: Intelligent Content Extraction from PDFs

Questions?Matt Kuznicki

Chief Technical [email protected]: mattkuznicki

Datalogics Inc.www.datalogics.com

Twitter: @DatalogicsInc