26
1 Two-dimensional Context-Free Grammars: Mathematical Formulae Recognition Daniel Průša, Václav Hlaváč Center for Machine Perception Faculty of Electrical Engineering Czech Technical University, Prague

1 Two-dimensional Context-Free Grammars: Mathematical Formulae Recognition Daniel Průša, Václav Hlaváč Center for Machine Perception Faculty of Electrical

Embed Size (px)

Citation preview

Page 1: 1 Two-dimensional Context-Free Grammars: Mathematical Formulae Recognition Daniel Průša, Václav Hlaváč Center for Machine Perception Faculty of Electrical

1

Two-dimensional Context-Free Grammars:

Mathematical Formulae Recognition

Daniel Průša, Václav Hlaváč

Center for Machine Perception

Faculty of Electrical Engineering

Czech Technical University, Prague

Page 2: 1 Two-dimensional Context-Free Grammars: Mathematical Formulae Recognition Daniel Průša, Václav Hlaváč Center for Machine Perception Faculty of Electrical

2

Presentation Overview

Formulae recognition, problem formulation Known methods General idea of structural recognition Two-dimensional context-free grammars Extension of the grammars Recognition tool, pilot implementation Results, future plans

Page 3: 1 Two-dimensional Context-Free Grammars: Mathematical Formulae Recognition Daniel Průša, Václav Hlaváč Center for Machine Perception Faculty of Electrical

3

Motivation for this work

To test a theoretical construct on a practical pilot problem with explicit structure mathematical formulae

The group of Schlesinger, Savchynskyy from Kiev works on music score recognition. We cooperate in a joint research project.

Page 4: 1 Two-dimensional Context-Free Grammars: Mathematical Formulae Recognition Daniel Průša, Václav Hlaváč Center for Machine Perception Faculty of Electrical

4

Math. formulae, off-line or on-line

Formulae recognition can be divided into two groups by the type of input:

1. Off-line recognition – a formula is depicted in a raster image.

2. On-line recognition – a formula represented by a sequence of pen strokes (growing importance due to tablet PCs).

Page 5: 1 Two-dimensional Context-Free Grammars: Mathematical Formulae Recognition Daniel Průša, Václav Hlaváč Center for Machine Perception Faculty of Electrical

5

Math. formulae recognition, usage

Off-line recognition – conversion of scanned printed mathematical texts into an electronic form.

On-line recognition – connected to pen-based computing technologies (electronic tablets).

There are many papers on formulae recognition, but only a few commercial products (e.g., xMathJournal by xThink)

Page 6: 1 Two-dimensional Context-Free Grammars: Mathematical Formulae Recognition Daniel Průša, Václav Hlaváč Center for Machine Perception Faculty of Electrical

6

Usual architectureTwo independent layers: Symbol detection and recognition. Structural analysis.

symbol recognition

structural analysis

error corrections (optional)

derivation tree

image, sequence of

strokes

symbols (+ coordinatesand font size)

Page 7: 1 Two-dimensional Context-Free Grammars: Mathematical Formulae Recognition Daniel Průša, Václav Hlaváč Center for Machine Perception Faculty of Electrical

7

Symbol recognition methods

Image segmentation + OCR tool. Image segmentation and character

recognition performed simultaneously (e.g., by Hidden Markov Models).

• It is very difficult to recover from errors made in segmentation phase.

• Semantic not taken into account.

Page 8: 1 Two-dimensional Context-Free Grammars: Mathematical Formulae Recognition Daniel Průša, Václav Hlaváč Center for Machine Perception Faculty of Electrical

8

Structural analysis methods

Grammar based • geometric grammars

• graph grammars

Non-grammar based• minimum spanning tree

• hard-coded rules

Page 9: 1 Two-dimensional Context-Free Grammars: Mathematical Formulae Recognition Daniel Průša, Václav Hlaváč Center for Machine Perception Faculty of Electrical

9

Our approach to structural recognition

Based on general structural constructions by M.I. Schlesinger, V. Hlaváč in Ten Lectures on Statistical and Syntactic Pattern Recognition (Kluwer Academic Publishers, 2002)

Do not separate segmentation and parsing, perform them simultaneously.• Suitable for recognition of objects with rich

structure.• Already successfully applied to music scores

and electric circuits diagrams.

Page 10: 1 Two-dimensional Context-Free Grammars: Mathematical Formulae Recognition Daniel Průša, Václav Hlaváč Center for Machine Perception Faculty of Electrical

10

Structural Recognition – General Idea

1. Algorithm starts with regions labeled by terminals- squares corresponding to one symbol,- regions detected by an external tool.

2. Bigger regions labeled by non-terminals are derived by applying the rules, each derivation is assigned by a penalty.

3. Result: region matching the whole picture with the smallest penalty.

A B

C D

N Region N is derived by a rule from regions A, B, C, D

Assumptions: input image, set of derivation rules

Recognition:

Page 11: 1 Two-dimensional Context-Free Grammars: Mathematical Formulae Recognition Daniel Průša, Václav Hlaváč Center for Machine Perception Faculty of Electrical

11

Structural Recognition Applied on Formulaeusing 2D Context-free Grammars

• Terminals detection - detect all possible occurrences of elementary symbols using an OCR tool, evaluate the occurrences by a penalty (computed by the OCR tool).

• Uniform shapes of regions considered – rectangles• 2D grammar for mathematical formulae designed.

fraction line, minus sign

symbol 5

Page 12: 1 Two-dimensional Context-Free Grammars: Mathematical Formulae Recognition Daniel Průša, Václav Hlaváč Center for Machine Perception Faculty of Electrical

12

Structural Recognition Applied on Formulaeusing 2D Context-free Grammars

Parsing – let the structural analysis decide what is the best segmentation and interpretation of the elementary symbols, i.e. find derivation tree covering the whole image, evaluated by the smallest penalty.

5 2

-

Page 13: 1 Two-dimensional Context-Free Grammars: Mathematical Formulae Recognition Daniel Průša, Václav Hlaváč Center for Machine Perception Faculty of Electrical

13

Two-dimensional Context-free Grammars

AN )1

B

AN )3

BAN | )2

Three basic types of productions in P:

),,,( PSVVG NT TV

NV

S

… set of terminals

… set of non-terminals

… initial non-terminal

P … set of productions

Generalized form of productions:

nmm

n

AA

AA

N

,1,

,11,1

TNji VVABA ,,,

NVN

Page 14: 1 Two-dimensional Context-Free Grammars: Mathematical Formulae Recognition Daniel Průša, Václav Hlaváč Center for Machine Perception Faculty of Electrical

14

Interpretation of Productions

AN )1

B

AN )3

BAN | )2

A

A B N

N

A

B

N

G generates pictures that can be named by the initial non-terminal S

Page 15: 1 Two-dimensional Context-Free Grammars: Mathematical Formulae Recognition Daniel Průša, Václav Hlaváč Center for Machine Perception Faculty of Electrical

15

Theoretical Results on 2D CF Languages

L(2CFG) ... class of languages that can be generated by a 2D CF grammar

• There is no analogy to the Chomsky normal form of productions

• Emptiness problem is not decidable

• L(2CFG) and L(2FSA) are not comparable

• Basic form of productions is weaker than general one

• Languages in L(2CFG) can be recognized in polynomial time

• L(2CFG) includes 1D context-free languages

Observation: natural generalization, but the properties of L(2CFG) differ to the properties of the class of 1D context-free languages.

Page 16: 1 Two-dimensional Context-Free Grammars: Mathematical Formulae Recognition Daniel Průša, Václav Hlaváč Center for Machine Perception Faculty of Electrical

16

Recognition in Polynomial Time

22)( nmnmO

2D CF grammars with productions in the basic form:

Generated languages can be recognized in time

(M.I. Schlesinger)nm picture size

Algorithm can be generalized on all languages in L(2CFG)

11 qp nmO

),,,( PSVVG NT

}][|max{ , PANsp tsij

}][|max{ , PANtq tsij

Maximal number of rows on the right-hand side of a production.

Maximal number of columns on the right-hand side of a production.

• degree of the polynomial depends on size of the productions

Page 17: 1 Two-dimensional Context-Free Grammars: Mathematical Formulae Recognition Daniel Průša, Václav Hlaváč Center for Machine Perception Faculty of Electrical

17

Extension of 2D CF Grammars

2D context-free grammar are not power enough to express complex structure of mathematical formulae.

53

351

+462

We need a formalism allowing to easily work with relative positions and sizes of symbols, e.g. to express relationships like “a symbol is superscript of another symbol”, etc.

Page 18: 1 Two-dimensional Context-Free Grammars: Mathematical Formulae Recognition Daniel Průša, Václav Hlaváč Center for Machine Perception Faculty of Electrical

18

Extension of 2D CF Grammars

Each derived region is assigned by a feature point (logical center). The feature point a derived region is determined by the applied production.

351

Regions are still rectangles.

Page 19: 1 Two-dimensional Context-Free Grammars: Mathematical Formulae Recognition Daniel Průša, Václav Hlaváč Center for Machine Perception Faculty of Electrical

19

Extension of 2D CF Grammars

Usage of productions is not limited on directly neighboring (touching) rectangles.

Productions can specify a rectangular area where some specific point of a rectangle has to be contained.

Position and sizes can be given relative to one of the rectangles.

Restrictions on relative sizes of rectangles are also possible.

53+2

A

B

C

Page 20: 1 Two-dimensional Context-Free Grammars: Mathematical Formulae Recognition Daniel Průša, Václav Hlaváč Center for Machine Perception Faculty of Electrical

20

Penalty Computation

Used production. Relative sizes and positions of regions the

production is applied on (original regions). Number of black pixels in the new region that

are not in the original regions. Penalty of the original regions.

Based on summing partial penalties determined by the following criterions:

Page 21: 1 Two-dimensional Context-Free Grammars: Mathematical Formulae Recognition Daniel Průša, Václav Hlaváč Center for Machine Perception Faculty of Electrical

21

Implementation of the Recognition Tool Off-line recognition. Implemented in Java. Trained and tuned for hand-written formulae. Black and white images (but can be extended on

gray-scale images). The following constructs are supported:

• variables, numbers, parenthesis,• common unary and binary operators, power to operator,• fractions, square root, subscripts, superscripts,• sum, integral.

Can deal with noise, ambiguities, touching or split symbols, etc. and also with misplaced symbols.

Page 22: 1 Two-dimensional Context-Free Grammars: Mathematical Formulae Recognition Daniel Průša, Václav Hlaváč Center for Machine Perception Faculty of Electrical

22

Tool Architecture

terminals detection

parsing2D grammar

OCR tool

Page 23: 1 Two-dimensional Context-Free Grammars: Mathematical Formulae Recognition Daniel Průša, Václav Hlaváč Center for Machine Perception Faculty of Electrical

23

Terminals Detection

Used OCR tool: A simple method implemented - feature vector extracted from image, k-nearest neighbor classifier used to classify the vector. Trained for all supported elementary symbols.

Ideally, all regions should be scanned for an elementary symbol presence, but this consumes much time, two smarter strategies implemented:

Limitations of the method: overlaping symbols’ bounding boxes, symbols that intersect

• Scanning rectangular windows of some predefined sizes (not all sizes).

• Detection based on connectivity components.

Page 24: 1 Two-dimensional Context-Free Grammars: Mathematical Formulae Recognition Daniel Průša, Václav Hlaváč Center for Machine Perception Faculty of Electrical

24

Remarks on Terminals Detection

• Symbols that do not have size limited by a constant are not treated as terminal symbols (e.g., fraction line, square root).

• In addition, square root cannot be separated from an image by a rectangle (it surrounds its argument).

Solution: Treat these cases as symbols composed of several terminal symbols, extend grammar by related productions.

22 ba

Page 25: 1 Two-dimensional Context-Free Grammars: Mathematical Formulae Recognition Daniel Průša, Václav Hlaváč Center for Machine Perception Faculty of Electrical

25

Parsing Algorithm

Bottom up approach, as described in the general structural recognition.

Complexity – depends on the number of terminals detected during the first phase; in general, can be exponential, but it is substantially reduced by production restristions and usage of suitable data structures

Data structures for orthogonal range queries (searching points that are located in a rectangle) used to speed up the algorithm.

Page 26: 1 Two-dimensional Context-Free Grammars: Mathematical Formulae Recognition Daniel Průša, Václav Hlaváč Center for Machine Perception Faculty of Electrical

26

Future Plans

Focus on printed formulae Collect sufficiently large set of annotated

printed formulae Apply learning methods: learn etalons of

elementary symbols and productions parameters