Extracting text from PDF (iOS)

Extracting text from PDF

How far does the rabbit hole go?

Kaz Yoshikawa [email protected]

May 2016

mailto:[email protected]

How to extract text from PDF on iOS?

🤔

I know some say "extracting text from PDF is really hard"

Just exaggerated, isn't it?

References

References• アジア言語圏のPDFのテキスト抽出

http://ponpoko1968.hatenablog.com/entry/20100810/1281438828http://ponpoko1968.hatenablog.com/entry/20100915/1284559500

• PDFビューワの作り方（連載）- HMDThttps://news.mynavi.jp/itsearch/article/devsoft/1212

• PDF千夜一夜 — アンテナハウスhttp://www.antenna.co.jp/pdf/reference/Blog-Index.htm

http://ponpoko1968.hatenablog.com/entry/20100810/1281438828

http://ponpoko1968.hatenablog.com/entry/20100915/1284559500

https://news.mynavi.jp/itsearch/article/devsoft/1212

http://www.antenna.co.jp/pdf/reference/Blog-Index.htm

References• PDFKitten

https://github.com/KurtCode/PDFKitten

https://github.com/KurtCode/PDFKitten

What is hard? Really?

Why so difficult?• iOS does not provide any API to extract text directly

(OS X has PDFKit – still limited)

• Core Graphics provides only very basic API

• Needs to write parser — hard! really!

• Extracted text data is not unicode

• Glyph ID to Unicode mapping

Understanding PDF Structure

Document - Page

Outline Pages

Document

Metadata

PagePage Page

Page - Font

MediaBox Resources

Page

Contents

… Font …

Tc1 Tc2

…

subtype… …

case: Type 1Subtype Type1

Name Referenced from Font subdirectory

BaseFont PostScript font name

FirstChar First character code defined in the font’s Widths array

LastChar Last character code defined in the font’s Widths array

Widths An array of (LastChar − FirstChar + 1) widths

FontDescriptor A font descriptor describing the font’s metrics other than its glyph widths

Encoding Font’s character encoding

ToUnicode CMap file that maps character codes to Unicode values

PDF Reference: p412

case: TrueTypeSubtype Type1









PDF Reference: p412

Same as Type1 with some differences

case: Type 3Subtype Type3


FontBBox A rectangle expressed in the glyph coordinate system

FontMatrix An array of six numbers specifying the font matrix, mapping glyph space to text space

CharProcs ??

FirstChar, LastChar ditto

Widths ditto – sort of

FontDescriptor A font descriptor describing the font’s default metrics other than its glyph widths

Resources A list of the named resources, such as fonts and images


PDF Reference: p420

Case: Type 0 Composite Fonts

Subtype CIDFontType0 or CIDFontType2


BaseFont The PostScript name of the CIDFont

CIDSystemInfo A dictionary containing entries that define the character collection of the CIDFont

FontDescriptor A font descriptor describing the CIDFont’s default metrics other than its glyph widths

DW The default width for glyphs in the CIDFont. Default value: 1000

DW2 An array of two numbers specifying the default metrics for vertical writing

W2 A description of the metrics for vertical writing for the glyphs in the CIDFont

CIDToGIDMap Type 2 CIDFonts only — omitted

PDF Reference: p436

😏

OK, PDF structure is pretty complex. Is there any tools?

Tools

PDF-VoyeurOpen Source

https://github.com/below/PDF-Voyeur

https://github.com/below/PDF-Voyeur

Font

Contents (Text, etc.)

BoundingBox

RotationAnnotation

Page

Understanding how PDFs are rendered?

Page Object knows enough about drawing page

MediaBox Resources

Page

Contents

Font

Tc2

dictionaryarray stream

Drawing operators

OperatorsBegin a text object

BT /F1 24 Tf 175 720 Td (Hello World!)Tj ET

End a text object

specify font

specify location

Draw Text

Rendering Japanese

/C2_0 1 Tf 0 Tc 175 720 Td <30533093306B3061306F> Tj

Tf, Td, TjPDF Reference: p398,406,407

Decoding Text

case 1 Has 'ToUnicode' entry

Font entrySubtype Type1









Parsing CMap

CMap Specification

Adobe CMap and CIDFont Files Specification

Version 1.0

11 June 1993

Adobe Developer Support

PN LPS5014

Adobe Systems Incorporated

Corporate Headquarters345 Park AvenueSan Jose, CA 95110(408) 536-6000 Main Number(408) 537-6000 Fax

European Engineering Support GroupAdobe Systems Benelux B.V.P.O. Box 227501100 DG AmsterdamThe Netherlands+31-20-6511 355Fax: +31-20-6511 313

Adobe Systems Eastern Region24 New EnglandExecutive ParkBurlington, MA 01803(617) 273-2120Fax: (617) 273-2336

Adobe Systems Co., Ltd.Gate City Ohsaki East Tower1-11-2 Ohsaki, Shinagawa-kuTokyo 141-0032Japan+81-3-5740-2620Fax: +81-3-5740-2621

®

® ®

https://www.adobe.com/content/dam/Adobe/en/devnet/font/pdfs/5014.CIDFont_Spec.pdf

102 pages

CMap example%!PS-Adobe-3.0 Resource-CMap %%Version: 1 /CIDInit /ProcSet findresource begin 12 dict begin begincmap /CIDSystemInfo 3 dict dup begin /Registry (Adobe) def /Ordering (Japan1) def /Supplement 0 def end def /CMapName /83pv-RKSJ-H def /CMapVersion 1 def /CMapType 0 def /UIDOffset 0 def

/XUID [1 10 25324] def /WMode 0 def 4 begincodespacerange <00> <80> <8140> <9ffc> <a0> <df> <e040> <fbfc> endcodespacerange 1 beginnotdefrange <00> <1f> 1 endnotdefrange

100 begincidrange <9780> <97fc> 3914 <9840> <9872> 4039 <989f> <98fc> 4090 <9940> <997e> 4184 <9980> <99fc> 4247 << 90 ranges missing >> <ed83> <ed83> 7934 <ed84> <ed84> 992 <ed85> <ed85> 7935 <ed86> <ed86> 994 <ed87> <ed87> 7936 endcidrange 17 begincidrange <ed88> <ed8d> 996 <ed8e> <ed8e> 7937 << 13 ranges missing >> <ee9a> <ee9a> 768 <ee9b> <ee9c> 7631 endcidrange

endcmap CMapName currentdict /CMap defineresource pop end end %%EndResource %%EOF

←Adobe Japan 1-0

←Horizontal/Vertical

←CID Range

←CID Range

begin-end-cidrange100 begincidrange <9780> <97fc> 3914 <9840> <9872> 4039 <989f> <98fc> 4090 <9940> <997e> 4184 <9980> <99fc> 4247 … <ed83> <ed83> 7934 <ed84> <ed84> 992 <ed85> <ed85> 7935 <ed86> <ed86> 994 <ed87> <ed87> 7936 endcidrange

• Code range between 0x9780 ～ 0x97fc

• will be mapped between 3914 ～ 4038

• Unicode code point: UCS2

• 16-bit

Some others• beginbfchar - endbfchar

• beginbfrange - endbfrange

• begincidchar - endcidchar

• begincidrange - endcidrange

• begincodespacerange - endcodespacerange

case 2Encoding: Identity-H or Identity-V,

No 'ToUnicode' entry

Using external CMap

• Check CIDSystemInfo

• Registy,Ordering,Supplement (eg. Adobe Japan 1-6)

• Adobe Type Tools https://github.com/adobe-type-tools/cmap-resources

https://github.com/adobe-type-tools/cmap-resources

Adobe Japan 1-6%!PS-Adobe-3.0 Resource-CMap

/CIDInit /ProcSet findresource begin

12 dict begin

begincmap

/CIDSystemInfo 3 dict dup begin /Registry (Adobe) def /Ordering (Japan1) def /Supplement 6 def end def

/CMapName /Adobe-Japan1-6 def /CMapVersion 1.005 def /CMapType 1 def

/XUID [1 10 25614] def

/WMode 0 def

/CIDCount 23058 def

1 begincodespacerange <0000> <5AFF> endcodespacerange

91 begincidrange <0000> <00ff> 0 <0100> <01ff> 256 <0200> <02ff> 512 <0300> <03ff> 768 <0400> <04ff> 1024 <0500> <05ff> 1280 <0600> <06ff> 1536 <0700> <07ff> 1792 <0800> <08ff> 2048 <0900> <09ff> 2304 … <5300> <53ff> 21248 <5400> <54ff> 21504 <5500> <55ff> 21760 <5600> <56ff> 22016 <5700> <57ff> 22272 <5800> <58ff> 22528 <5900> <59ff> 22784 <5a00> <5a11> 23040 endcidrange endcmap CMapName currentdict /CMap defineresource pop end end

https://github.com/adobe-type-tools/cmap-resources/blob/master/cmapresources_japan1-6/CMap/Adobe-Japan1-6

Be careful, character code may not be Unicode.

case 3 No 'ToUnicode' entry,

Encoding: "WinAnsiEncoding" etc.

Use following encoding

WinAnsiEncoding NSWindowsCP1252StringEncoding

MacRomanEncoding …

MacExpertEncoding …

Enough Talk…Let's code

Find the 1st page

Outline Pages

Document

Metadata

PagePage Page

CGPDFOperatorTable

←Callback

Some Tips

CGPDFDictionaryApplyFunction

• CGPDFDictionaryApplyFunction()

• C-Style callback

• not possible in Swift 1.x (probably)

• possible in Swift 2

• enumerate each entry in CGPDFDictionary

Utility function

DEMO

Wrap up

• Understanding PDF Structure

• Too many encodings — hard to find test data

• Too complex –– documentation is not always clear

• Yah, Parsing PDF is hard, really…

Thank YouKaz Yoshikawa

[email protected]

mailto:[email protected]