46
Extracting text from PDF How far does the rabbit hole go? Kaz Yoshikawa [email protected] May 2016

Extracting text from PDF (iOS)

Embed Size (px)

Citation preview

Page 1: Extracting text from PDF (iOS)

Extracting text from PDF

How far does the rabbit hole go?

Kaz Yoshikawa [email protected]

May 2016

Page 2: Extracting text from PDF (iOS)

How to extract text from PDF on iOS?

Page 3: Extracting text from PDF (iOS)

🤔

I know some say "extracting text from PDF is really hard"

Just exaggerated, isn't it?

Page 4: Extracting text from PDF (iOS)

References

Page 5: Extracting text from PDF (iOS)

References• アジア言語圏のPDFのテキスト抽出

http://ponpoko1968.hatenablog.com/entry/20100810/1281438828http://ponpoko1968.hatenablog.com/entry/20100915/1284559500

• PDFビューワの作り方 (連載)- HMDThttps://news.mynavi.jp/itsearch/article/devsoft/1212

• PDF千夜一夜 — アンテナハウスhttp://www.antenna.co.jp/pdf/reference/Blog-Index.htm

Page 6: Extracting text from PDF (iOS)

References• PDFKitten

https://github.com/KurtCode/PDFKitten

Page 7: Extracting text from PDF (iOS)

What is hard? Really?

Page 8: Extracting text from PDF (iOS)

Why so difficult?• iOS does not provide any API to extract text directly

(OS X has PDFKit – still limited)

• Core Graphics provides only very basic API

• Needs to write parser — hard! really!

• Extracted text data is not unicode

• Glyph ID to Unicode mapping

Page 9: Extracting text from PDF (iOS)

Understanding PDF Structure

Page 10: Extracting text from PDF (iOS)

Document - Page

Outline Pages

Document

Metadata

PagePage Page

Page 11: Extracting text from PDF (iOS)

Page - Font

MediaBox Resources

Page

Contents

… Font …

Tc1 Tc2

subtype… …

Page 12: Extracting text from PDF (iOS)

case: Type 1Subtype Type1

Name Referenced from Font subdirectory

BaseFont PostScript font name

FirstChar First character code defined in the font’s Widths array

LastChar Last character code defined in the font’s Widths array

Widths An array of (LastChar − FirstChar + 1) widths

FontDescriptor A font descriptor describing the font’s metrics other than its glyph widths

Encoding Font’s character encoding

ToUnicode CMap file that maps character codes to Unicode values

PDF Reference: p412

Page 13: Extracting text from PDF (iOS)

case: TrueTypeSubtype Type1

Name Referenced from Font subdirectory

BaseFont PostScript font name

FirstChar First character code defined in the font’s Widths array

LastChar Last character code defined in the font’s Widths array

Widths An array of (LastChar − FirstChar + 1) widths

FontDescriptor A font descriptor describing the font’s metrics other than its glyph widths

Encoding Font’s character encoding

ToUnicode CMap file that maps character codes to Unicode values

PDF Reference: p412

Same as Type1 with some differences

Page 14: Extracting text from PDF (iOS)

case: Type 3Subtype Type3

Name Referenced from Font subdirectory

FontBBox A rectangle expressed in the glyph coordinate system

FontMatrix An array of six numbers specifying the font matrix, mapping glyph space to text space

CharProcs ??

FirstChar, LastChar ditto

Widths ditto – sort of

FontDescriptor A font descriptor describing the font’s default metrics other than its glyph widths

Resources A list of the named resources, such as fonts and images

ToUnicode CMap file that maps character codes to Unicode values

PDF Reference: p420

Page 15: Extracting text from PDF (iOS)

Case: Type 0 Composite Fonts

Subtype CIDFontType0 or CIDFontType2

Name Referenced from Font subdirectory

BaseFont The PostScript name of the CIDFont

CIDSystemInfo A dictionary containing entries that define the character collection of the CIDFont

FontDescriptor A font descriptor describing the CIDFont’s default metrics other than its glyph widths

DW The default width for glyphs in the CIDFont. Default value: 1000

DW2 An array of two numbers specifying the default metrics for vertical writing

W2 A description of the metrics for vertical writing for the glyphs in the CIDFont

CIDToGIDMap Type 2 CIDFonts only — omitted

PDF Reference: p436

Page 16: Extracting text from PDF (iOS)

😏

OK, PDF structure is pretty complex. Is there any tools?

Page 17: Extracting text from PDF (iOS)

Tools

Page 18: Extracting text from PDF (iOS)

PDF-VoyeurOpen Source

https://github.com/below/PDF-Voyeur

Page 19: Extracting text from PDF (iOS)

Font

Contents (Text, etc.)

BoundingBox

RotationAnnotation

Page

Page 20: Extracting text from PDF (iOS)

Understanding how PDFs are rendered?

Page 21: Extracting text from PDF (iOS)

Page Object knows enough about drawing page

MediaBox Resources

Page

Contents

Font

Tc2

dictionaryarray stream

Drawing operators

Page 22: Extracting text from PDF (iOS)

OperatorsBegin a text object

BT /F1 24 Tf 175 720 Td (Hello World!)Tj ET

End a text object

specify font

specify location

Draw Text

Page 23: Extracting text from PDF (iOS)

Rendering Japanese

/C2_0 1 Tf 0 Tc 175 720 Td <30533093306B3061306F> Tj

Page 24: Extracting text from PDF (iOS)

Tf, Td, TjPDF Reference: p398,406,407

Page 25: Extracting text from PDF (iOS)

Decoding Text

Page 26: Extracting text from PDF (iOS)

case 1 Has 'ToUnicode' entry

Page 27: Extracting text from PDF (iOS)

Font entrySubtype Type1

Name Referenced from Font subdirectory

BaseFont PostScript font name

FirstChar First character code defined in the font’s Widths array

LastChar Last character code defined in the font’s Widths array

Widths An array of (LastChar − FirstChar + 1) widths

FontDescriptor A font descriptor describing the font’s metrics other than its glyph widths

Encoding Font’s character encoding

ToUnicode CMap file that maps character codes to Unicode values

Page 28: Extracting text from PDF (iOS)

Parsing CMap

Page 29: Extracting text from PDF (iOS)

CMap Specification

Adobe CMap and CIDFont Files Specification

Version 1.0

11 June 1993

Adobe Developer Support

PN LPS5014

Adobe Systems Incorporated

Corporate Headquarters345 Park AvenueSan Jose, CA 95110(408) 536-6000 Main Number(408) 537-6000 Fax

European Engineering Support GroupAdobe Systems Benelux B.V.P.O. Box 227501100 DG AmsterdamThe Netherlands+31-20-6511 355Fax: +31-20-6511 313

Adobe Systems Eastern Region24 New EnglandExecutive ParkBurlington, MA 01803(617) 273-2120Fax: (617) 273-2336

Adobe Systems Co., Ltd.Gate City Ohsaki East Tower1-11-2 Ohsaki, Shinagawa-kuTokyo 141-0032Japan+81-3-5740-2620Fax: +81-3-5740-2621

®

® ®

https://www.adobe.com/content/dam/Adobe/en/devnet/font/pdfs/5014.CIDFont_Spec.pdf

102 pages

Page 30: Extracting text from PDF (iOS)

CMap example%!PS-Adobe-3.0 Resource-CMap %%Version: 1 /CIDInit /ProcSet findresource begin 12 dict begin begincmap /CIDSystemInfo 3 dict dup begin /Registry (Adobe) def /Ordering (Japan1) def /Supplement 0 def end def /CMapName /83pv-RKSJ-H def /CMapVersion 1 def /CMapType 0 def /UIDOffset 0 def

/XUID [1 10 25324] def /WMode 0 def 4 begincodespacerange <00> <80> <8140> <9ffc> <a0> <df> <e040> <fbfc> endcodespacerange 1 beginnotdefrange <00> <1f> 1 endnotdefrange

100 begincidrange <9780> <97fc> 3914 <9840> <9872> 4039 <989f> <98fc> 4090 <9940> <997e> 4184 <9980> <99fc> 4247 << 90 ranges missing >> <ed83> <ed83> 7934 <ed84> <ed84> 992 <ed85> <ed85> 7935 <ed86> <ed86> 994 <ed87> <ed87> 7936 endcidrange 17 begincidrange <ed88> <ed8d> 996 <ed8e> <ed8e> 7937 << 13 ranges missing >> <ee9a> <ee9a> 768 <ee9b> <ee9c> 7631 endcidrange

endcmap CMapName currentdict /CMap defineresource pop end end %%EndResource %%EOF

←Adobe Japan 1-0

←Horizontal/Vertical

←CID Range

←CID Range

Page 31: Extracting text from PDF (iOS)

begin-end-cidrange100 begincidrange <9780> <97fc> 3914 <9840> <9872> 4039 <989f> <98fc> 4090 <9940> <997e> 4184 <9980> <99fc> 4247 … <ed83> <ed83> 7934 <ed84> <ed84> 992 <ed85> <ed85> 7935 <ed86> <ed86> 994 <ed87> <ed87> 7936 endcidrange

• Code range between   0x9780 ~ 0x97fc

• will be mapped between   3914 ~ 4038

• Unicode code point: UCS2

• 16-bit

Page 32: Extracting text from PDF (iOS)

Some others• beginbfchar - endbfchar

• beginbfrange - endbfrange

• begincidchar - endcidchar

• begincidrange - endcidrange

• begincodespacerange - endcodespacerange

Page 33: Extracting text from PDF (iOS)

case 2Encoding: Identity-H or Identity-V,

No 'ToUnicode' entry

Page 34: Extracting text from PDF (iOS)

Using external CMap

• Check CIDSystemInfo

• Registy,Ordering,Supplement (eg. Adobe Japan 1-6)

• Adobe Type Tools https://github.com/adobe-type-tools/cmap-resources

Page 35: Extracting text from PDF (iOS)

Adobe Japan 1-6%!PS-Adobe-3.0 Resource-CMap

/CIDInit /ProcSet findresource begin

12 dict begin

begincmap

/CIDSystemInfo 3 dict dup begin /Registry (Adobe) def /Ordering (Japan1) def /Supplement 6 def end def

/CMapName /Adobe-Japan1-6 def /CMapVersion 1.005 def /CMapType 1 def

/XUID [1 10 25614] def

/WMode 0 def

/CIDCount 23058 def

1 begincodespacerange <0000> <5AFF> endcodespacerange

91 begincidrange <0000> <00ff> 0 <0100> <01ff> 256 <0200> <02ff> 512 <0300> <03ff> 768 <0400> <04ff> 1024 <0500> <05ff> 1280 <0600> <06ff> 1536 <0700> <07ff> 1792 <0800> <08ff> 2048 <0900> <09ff> 2304 … <5300> <53ff> 21248 <5400> <54ff> 21504 <5500> <55ff> 21760 <5600> <56ff> 22016 <5700> <57ff> 22272 <5800> <58ff> 22528 <5900> <59ff> 22784 <5a00> <5a11> 23040 endcidrange endcmap CMapName currentdict /CMap defineresource pop end end

https://github.com/adobe-type-tools/cmap-resources/blob/master/cmapresources_japan1-6/CMap/Adobe-Japan1-6

Be careful, character code may not be Unicode.

Page 36: Extracting text from PDF (iOS)

case 3 No 'ToUnicode' entry,

Encoding: "WinAnsiEncoding" etc.

Page 37: Extracting text from PDF (iOS)

Use following encoding

WinAnsiEncoding NSWindowsCP1252StringEncoding

MacRomanEncoding …

MacExpertEncoding …

Page 38: Extracting text from PDF (iOS)

Enough Talk…Let's code

Page 39: Extracting text from PDF (iOS)

Find the 1st page

Outline Pages

Document

Metadata

PagePage Page

Page 40: Extracting text from PDF (iOS)

CGPDFOperatorTable

←Callback

Page 41: Extracting text from PDF (iOS)

Some Tips

Page 42: Extracting text from PDF (iOS)

CGPDFDictionaryApplyFunction

• CGPDFDictionaryApplyFunction()

• C-Style callback

• not possible in Swift 1.x (probably)

• possible in Swift 2

• enumerate each entry in CGPDFDictionary

Page 43: Extracting text from PDF (iOS)

Utility function

Page 44: Extracting text from PDF (iOS)

DEMO

Page 45: Extracting text from PDF (iOS)

Wrap up

• Understanding PDF Structure

• Too many encodings — hard to find test data

• Too complex –– documentation is not always clear

• Yah, Parsing PDF is hard, really…

Page 46: Extracting text from PDF (iOS)

Thank YouKaz Yoshikawa

[email protected]