12
Representation, handling and recognition of mathematical objects: state of the art Widad Jakjoud Department of Computer Sciences Faculty of Sciences, Cadi Ayyad University Marrakech, Morocco [email protected] Abstract—Mathematical language was developed by mathematicians in order to facilitate mathematical knowledge communication. It will be more interesting if we can use the same language for the human – machine communication. This requires an adequate representation, manipulation and recognition of the two dimensional structure of mathematical notation between human and machine. In this paper, we try to define, present, and modify mathematical objects. It presents a short review on standards and systems of presentation, engineering and approaches of physical and logical segmentations, detection systems and methods of mathematical objects isolated or inserted in text, structures and methods of representation for different recognition approaches. We focus on the description of the process of structural recognition. Keywords-component: Mathematical objects; Segmentation; Mathematical object recognition; Recognition symbol; Symbol arrangement analysis; Mathematical notation; Syntactic method; Graph rewriting. I. INTRODUCTION Mathematical language is used by mathematicians in order to represent mathematical objects to be communicated. It permits to remove and omit all information, which could be deduced by reader. Mathematical notations are semi standardized and appearance of new notations is not impossible. Mathematical object is a kernel of mathematical document; it can be a set of mathematical elementary objects arranged accordingly to mathematical language grammar rules. A mathematical elementary object is the more small entity which has a sense and is indivisible in other mathematical objects. Mathematical expression is a mathematical object or a set of objects possibly interconnected or arranged by operators. A mathematical object can be: Words which are mathematical shortenings (e.g. sin, cos, …) or interconnected words (e.g. if, then, …); Operators (e.g. arithmetical operators, logical operators, …); Signs (e.g. integral, derivation, …); Alphabetical symbols (e.g. Greek, Latin, Arabic, Hebrew, …). Mathematical object can be spatially arranged as two- dimensional structure, it is governed by spatial rules (e.g. exponents, indices ...). II. MATHEMATICAL OBJECT REPRESENTATION A. Introduction The manipulation of mathematical documents poses several kinds of problems particular mathematical objects problems are how to: notat mathematical objects; represent mathematical objects; archive and restore mathematical objects; present, or display, mathematical documents manipulate expressions in a formula. Particularly, regarding to: a two-dimensional structure of mathematical objects; a multilingual environment; a no-standardization mathematical notations. B. Mathematical objects publishing Mathematical objects publishing poses problems particularly for two-dimensional objects, such as the integral symbol, the squared root, the fraction. Even if several searches are focused in, results remain insignificant. Various methods are in use: The first used method is the typing a sequence of commands objects following a linear syntax, what means typing the mathematical object using keys words designing operators. This method has been used before the appearance of graphic terminals (e.g. L A T E X, Maple). 9781-4244-2865-6/09/$25.00 ©2009 IEEE

[IEEE 2009 Third International Conference on Research Challenges in Information Science (RCIS) - Fez, Morocco (2009.04.22-2009.04.24)] 2009 Third International Conference on Research

  • Upload
    widad

  • View
    218

  • Download
    1

Embed Size (px)

Citation preview

Representation, handling and recognition of mathematical objects: state of the art

Widad Jakjoud Department of Computer Sciences

Faculty of Sciences, Cadi Ayyad University Marrakech, Morocco

[email protected]

Abstract—Mathematical language was developed by mathematicians in order to facilitate mathematical knowledge communication. It will be more interesting if we can use the same language for the human – machine communication. This requires an adequate representation, manipulation and recognition of the two dimensional structure of mathematical notation between human and machine.

In this paper, we try to define, present, and modify mathematical objects. It presents a short review on standards and systems of presentation, engineering and approaches of physical and logical segmentations, detection systems and methods of mathematical objects isolated or inserted in text, structures and methods of representation for different recognition approaches. We focus on the description of the process of structural recognition. Keywords-component: Mathematical objects; Segmentation; Mathematical object recognition; Recognition symbol; Symbol arrangement analysis; Mathematical notation; Syntactic method; Graph rewriting.

I. INTRODUCTION

Mathematical language is used by mathematicians in order to represent mathematical objects to be communicated. It permits to remove and omit all information, which could be deduced by reader.

Mathematical notations are semi standardized and appearance of new notations is not impossible.

Mathematical object is a kernel of mathematical document; it can be a set of mathematical elementary objects arranged accordingly to mathematical language grammar rules.

A mathematical elementary object is the more small entity which has a sense and is indivisible in other mathematical objects. Mathematical expression is a mathematical object or a set of objects possibly interconnected or arranged by operators.

A mathematical object can be:

• Words which are mathematical shortenings (e.g. sin, cos, …) or interconnected words (e.g. if, then, …);

• Operators (e.g. arithmetical operators, logical operators, …);

• Signs (e.g. integral, derivation, …);

• Alphabetical symbols (e.g. Greek, Latin, Arabic, Hebrew, …).

Mathematical object can be spatially arranged as two-dimensional structure, it is governed by spatial rules (e.g. exponents, indices ...).

II. MATHEMATICAL OBJECT REPRESENTATION

A. Introduction The manipulation of mathematical documents poses

several kinds of problems particular mathematical objects problems are how to:

• notat mathematical objects;

• represent mathematical objects;

• archive and restore mathematical objects;

• present, or display, mathematical documents

• manipulate expressions in a formula.

Particularly, regarding to:

• a two-dimensional structure of mathematical objects;

• a multilingual environment; • a no-standardization mathematical notations.

B. Mathematical objects publishing

Mathematical objects publishing poses problems particularly for two-dimensional objects, such as the integral symbol, the squared root, the fraction. Even if several searches are focused in, results remain insignificant. Various methods are in use:

• The first used method is the typing a sequence of commands objects following a linear syntax, what means typing the mathematical object using keys words designing operators. This method has been used before the appearance of graphic terminals (e.g. LATEX, Maple).

9781-4244-2865-6/09/$25.00 ©2009 IEEE

• Method of models’ palette, it is using a set of models from a menu bar corresponding to different formulas.

• the modes of two dimensional publishing that remain non-obvious

• Handwritten editions of the paper pencil type. Two examples have been alleged in the search:

Richard H. Littin system used a data table or graphical table. Mathematical objects are recognized at the moment of the acquisition. This method is based on a LR grammar; the result is a linear prefixed expression so as Lisp way. The problem is that the modification is limited to the correction of the last symbol of the expression.

Fuduka system imposes a pause after typing each symbol to verify and probably to correct the letter before passing to the following symbol.

The handwritten edition advantages result in the fact that it’s more natural and more intuitive. However, there disadvantages result in the fact that the system must be able to fit different styles of writing and imperfections, such as approximate alignment symbols, size of less significant symbols, etc.

C. Mathematical object representation

In order to communicate mathematical objects to computers, one must determine an intern representation of those objects. Different standards and systems have been conceived in order to facilitate this communication. They offer a semantic and/or syntactic representation of the mathematical object. Hereafter we present a non exhaustive list for mathematical formulas representation systems and standards:

a) TEX/LATEX LATEX is a document publication system which composes

mathematical formulas. Publishing complex formulas with LATEX is, practically, impossible without errors. LATEX is a set of macros making the system easier to use. It’s the works’ result of D. E. Knuth [1] who develops both of languages METAFONT language for the mathematical description of font, and TEX language for arranging glyphs together, in order to form words, sentences, and mathematical object.

The basic idea of LATEX was is to make possible the presentation of a large variety of mathematical symbols solely by set of characters available on common computer keyboards. It gives the formula an elegant and automatic presentation: spaces are automatically added around operators, the details about size reduction and vertical displacement of scripts are coded in the formatting semantics of the scripting operator and they are automatically adjusted. LATEX uses a linear syntax; the user has, therefore, to know how to manipulate commands and keys words which design operators.

TEX/LATEX are pure languages for mathematical objects representation without any treatment, which concerns semantics’ mathematical objects.

They represent all the mathematical objects in a linear way, while it eliminates all information that can be easily deduced by the human to reconstruct the object, for instance the expressions structure or their types.

The users of mathematical objects represented with LATEX have to define an adequate semantic according to context. The software of symbolic computation, or computer algebra, and checking cannot use mathematical objects’ LATEX representation independently of semantic information to reconstruct objects. Semantic can be extracted from LATEX document by converting mathematical objects on MathML, for example.

LATEX reduces the two dimensional aspect of mathematical formula in a linear representation; several semantic information are omitted and lost.

If the transfer of the mathematical knowledge is possible between mathematicians, then this loss is would be insignificant. Within the assumption that the mathematician who receives the expression can, no doubt, deduce the semantics. However, in the case where the exchange would be with software, it must explicit clearer semantic information completing the representation of the mathematical expression. Extracting semantic from LATEX’s representation object is based on the series of processes [3].

b) MathML MathML is an XML-based standard that defines no macro

and it is in thired version.

Both of the two MathML’s parts, Presentation MathML and Content MathML, are absolutely independent but complementary and each of them is used for a specific use.

Presentation MathML is responsible of the display and printing graphically. However, Content MathML is responsible of the semantic representation. It accounts for the formula’s content computing, verifications... Even though, this part is not completely developed and recognizes, only for a limited group of the mathematical objects.

The syntax of MathML inherits of XML, a very interesting tool of data presentation and the diffusion on the Web. To generate the MathML’s mathematical objects code, one can use, equations editors, conversions programs to other specialized software tools.

MathML must manage the presentation and content; the notion of semantic is integrated into the two MathML’s parts Presentation MathML and Content MathML [1].

Some semantics are into Presentation MathML:

• The structure of mathematical object is already a semantic for MathML. Operators’ precedence can just be deduced from object structure. Indeed, one can deduce that a contained operator in a mrow element has a greater priority than another operator declared before the mrow, even before to know the nature of those operators.

• The encoding used in MathML reflects a semantic:

Unicode classifies characters according to their semantics, not to their shapes. For example, derivation symbol d is encoded like a differential entity not like a symbol resembling to a letter.

It’s important to note that the invisible operators, such as multiplication sign, “apply function” (separating a function from its argument as in cos x or cos(x)), and “invisible comma”, separating multiple indices. All invisible operators are indicated explicitly with Unicode in MathML which don’t let any ambiguity.

For Content MathML’s semantic, it will be enough to mention that Content aims to encode the meaning of mathematical object.

c) OpenMath

OpenMath is a powerful standard and more general that Content MathML. It is interesting on the meaning of object building. It uses an extremely simple software kernel based on a markup system to recognize syntactic primitive forms with, furthermore, an extension mechanism for mathematical concepts: the “content dictionaries”.

While OpenMath objects are essentially symbolized by some trees - where the nodes are symbols or variables - other kinds of mathematical objects of which the definition contains the same data structure several times needs more space-efficient representations (e.g. x * x * x * x * x). OpenMath represents those objects as a directed acyclic graph (DAG) which can be exploded into a tree by recursively copying all sub-graphs that have more than one incoming graph edge, DAGs can conserve space by structure sharing [2].

Figure 1. Mathematical object tree’s representation

Figure 2. Mathematical object directed acyclic graph’s representation

d) OmDoc This is the fruit of the main ambition to define a markup

language that based on OpenMath and the Content MathML for the knowledge representation and mathematical objects exchanged between different mathematical software

(computers algebra, automatic theorem proof, formulae validation system, etc.).

It integrates OpenMath and Content MathML, it inherits of the syntax of the well-known XML standard.

Among intentions behind OmDoc’s conception, the separation of contents and presentation: it treats content and not the presentation.

The fact that OMDOC inherits of the syntax of XML, permits that it can use the functionalities of the processor of the transformation XSLT in order to generate the desired output format. In fact, XSLT is the responsible to convert others formats based on XML.

D. Mathematical objects and computer algebra

A computer algebra system is software permitting the manipulation of mathematical objects. The meaning of its name is that calculations are made on formulae as absolute algebric manipulating symbols.

The computer algebra systems are limited by the methods that are implemented while conception. They cannot, for example, recognize new theorem. The mathematical knowledge used in computer algebra is, almost, available only on paper (mathematical books, physical one…). Even though, several applications for deducing bases permit grouping those knowledge (data bases of lemma, mathematical formula’s electronic collection …) the only way to feed those bases still the hand writing.

E. Documents and Mathematical objects storage

The mathematic knowledge is very large. The authors have to note different old or recent references. This is practically impossible. Indeed, one cannot study all the existing literature of mathematical documents even in only one discipline and not in all of the mathematical knowledge. The development of distributed data bases’ system for the mathematical documents storage resolve this problem by automatically establishing the links between papers’ addresses and the papers in the mathematical journals.

Several works have been done on the mathematical knowledges data warehouse. Among them we note:

For mathematical objects storage, the DLMF (Digital Library of Mathematical Functions) project directed by the National Institute of Standard and Technology that aims developing a substitute of the Abramowtiz’s manual which should become an important resource.

The project aims to publish a manual and an accompaniment Web site. While the project is interesting into the mathematical formula, the graphs, the references, the computing methods and the links to the software it tries to include the 3D-graphical representations and the equations research tools.

A warehouse of scientific storage already exists, the JSTOR. It permits storing the reliable scientific journals and establish an access to these lasts whatever the geographical emplacement of the user by allowing him to print some papers or download

them. JSTOR is not a warehouse of actual publication, there exist a temporal shift of 1 to 5 years in general between the last publication and the JSTOR most recent publication dates.

F. Multilanguage displaying of mathematical documents

The mathematical language is a human language for mathematical objects representation. It is made to be universal and in order to allow the communication of the mathematical knowledge between mathematicians, who may be, of several cultures.

The mathematical documents and the contained mathematical objects must eventually be displayed in two different languages. It’s the case of the documents on the Moroccan way, where the objects are writing from left to right, contrary to the rest of the text that is writing from right to left.

While the mathematical objects are two-dimensional, their manipulation into a two dimensional context has a complexity greater than in a simple text.

III. THE MATHEMATICAL OBJECT RECOGNITION

A. The recognition methods

The mathematical objects recognition may be considered as a special forms recognition case.

Several groups of methods exist: Markoviens, neuronal, structural methods.

Markoviens methods: used for elements for which the characteristics are to observe in time or in space. The decision is made by the probabilistic theories.

Neuronal methods: based on the perception notion or on the multilayer networks.

Structural methods [5, 20]: are most appropriates to documents for which the elements are represented into a structured format; they are the most used in the mathematical documents case. Indeed, the mathematical expressions are a part of the modeled formats having a steadying and plan structure.

B. Problem

Mathematical objects recognition is subdivided in two parts: the recognition of different symbols constituted the object and the recognition of the object from composing symbols. Recognition process is composed essentially by four phases: early processing, segmentation, symbols’ recognition and structural recognition.

Early processing serves to suppress noises due to the acquisition (quality of scan, sampling frequency of the graphic slide...) in the case of manuscripts one must normalize the data and make them independent of the writing production (writing speed, shape...).

Recognition phase of the symbol is divided into two parts: the segmentation and the recognition.

Recognition phase of the expression structure is divided into three parts: the identification of the spatial relationships existing between symbols, the identification of the logic relationship and finally, the construction of the sense by the introduction of lexical knowledge grammatical and semantic.

The recognition systems may mix or change the order of execution of some steps. Some mathematical expressions recognition systems do not contain the symbol recognition step.

C. The recognition techniques

The recognition techniques must palliate to several problems such as:

• The spatial disposition of the operators that permits the definition of the operands position,

• The precedence priority of the operators, which determines the valuation of the formulae.

• The nature of the symbol identification (operand or operator).

• The problem of the small symbols, such as. ,; ,: , ~,. They are an ambiguity source while reducing the noise after the memorization phase for the typographic documents.

• The symbols segmentation is not always easy especially in the case of manuscripts.

• The symbols that may be ambiguous (e.g. the symbol “.” may be the multiplication operator, the decimal sign or even the derivation symbol). This problem must be treated in the symbols structure analysis phase.

• The symbols recognition and the identification spatial relations, which is very difficult to determine so as clear and reliable because of relative position, size and relative style of symbol.

• The logical relations identification, which uses spatial relations between symbols, examples are alleged in [5]. Indeed, the symbols disposition one relative to the others determines the meaning of the mathematical object.

D. Early processing

Data acquisition is, naturally, the first step in recognition process. After the operation of scanning a document, several treatments decide the quality of the operation:

a) Scanning threshold. b) Noise reduction to filter the imperfections from the

scan. This step is very decisive and presents an enormous problem considering that several mathematical symbols have a very small size and can be confused with the noise. The method of P. Chou [6] based on a grammar probability very sentient to the noise permits to palliate to this problem.

c) Text realignment to correct the text in the case where the document’s paper was badly positioned on the scan process.

• A method based on successive rotations: it consists to perform a succession of horizontals and/or verticals projections, varying feebly the projective angle, in order to determine pixels black histograms, which compose the image, and so to search to maximize histogram’s picks and trough.

• And, an other method based on the histogram’s straight slopes which bind symbols’ neighboring centers.

d) Separation of mathematical objects from a document: This is a very delicate operation, the generality of recognition systems suppose that this step is already performed.

Mathematical objects in documents can be isolated formulas (IF) or Embedded formulas (EF), included formulas in the text. Isolated formulas are simpler to insulate and to extract than embedded ones.

Different approaches will be cited hereafter:

e) Construction of symbols bounding boxes:

Three methods can be used:

• Bounding boxes such as a rectangular region, which used in [7].

• bounding boxes such as a rectangular individual region which is similar to previous method but more complicated.

• Bounding boxes such as a convex region. This method is the best one notably for complexity’s level.

The type results are the set of symbols within the index relatives to their bounding boxes.

E. Segmentation:

The segmentation means the individual identification of each symbol composing the mathematical object.

The technical method used is the vertical lines projection:

a) Physical segmentation methods are very varied:

Increasing methods: destined to flat textual documents, so they are less interesting in case of mathematical document considering the spatial nature of mathematical objects.

It is possible to fit those methods to mathematical document case by adding symbols nonlinearity constraints and/or modifying those which impose the linearity of the segmented symbols.

Descended methods: A family of those methods is based on the multi resolution and uses a fuzzy vision of the set of symbols constituting the text words. Those methods are, in the case of the mathematical objects, more reliable and rapid [8]. Those methods run to the grey level which not a problem considering that mathematical document’s image is in black and white so it can be considered as an image in grey level with simply two shades (white and black).

The segmentation process is divided into many steps [8]:

• step of treating document characteristics valuation (size, width, height of characters);

• step of breaking up the document to a set of textual blocks (very low resolution);

• step of isolating blocks processing in order to divide each block in a set of horizontal lines;

• Step of the characters extraction from each line.

b) Logical segmentation:

While the physical segmentation interests in dividing document into a set of blocks, dividing each block into lines, and finally dividing the line into symbols. The logical segmentation aims to localize mathematical objects in two steps: to detect big mathematical objects, fat expressions, then to localize small formulae and mathematical elementary objects.

Big mathematical objects detection: many indexes permit to facilitate the detection of big mathematical objects:

• they are isolated from the text;

• they are relatively centered into the page;

• generally, they are very big with big lines;

• they are smaller width with rapport to document page width.

Theses indexes are very interesting for the formulae detection but they are not enough to build an exhaustive detection process. Other criteria can be used such as:

• Spacing between characters: detecting short space can deduce that it is space inter-symbols in the same word, otherwise it is a space between words. The inter-characters space is relative to the nature of the document (resolution, impression’s parameters, etc.),. Techniques are used to calculate short space between symbols in the same word. For example, histogram’s technique clearances inter-symbols on a line [8].

• The average of the characters’ number by word in the line permits to detect if this line can or not contain a mathematical expression. Indeed, mathematical words are shortenings (sin, cos, ...) and the majority of those words are composed by 3 to 4 symbols.

• Their densities are more low with report to the rest of the text. Even that this indicator is not reliable, for example, a small text paragraph can have exactly similar features.

• Isolated expressions are, mostly, present in separated regions on document and they are constituted particularly of special symbols (italic, roman symbols, roman words, etc.).

Small and elementary mathematical objects detection:

Recognition of different expression’s symbols must done in a posterior step to the segmentation [11], but marking the maximum of symbols can be useful

Techniques of the symbols detection is varying proportionately accordingly to the type of symbols to recognize:

• Typographies symbols: given the complexity and the forms variety to recognize, current tendencies are not more to the use of a unique method. The employment of mixed methods or hybrid tends to be generalized:

models mixing syntactic analysis and relaxation [7].

Models combining syntactic and statistical shapes description in order to improve the shapes representation with error.

models mixing syntactic approximations and approaches based on neuronal networks applied to the recognition.

• Handwritten symbols: techniques are based on methods of the cursive writing recognition: Syntactic methods;

Recognition methods of particular classes of symbols as number

Methods based on the stochastic model:

o Searches based on the Hidden Markov Model HMM;

o Searches based on Bayesian methods.

• Among used methods for the small expression detection, there is the exploitation method of the redundancies, which classifies characters in a binary tree following their graphic features (the surface to bounding box, height, total and partial projections …). This method is based on a system of expertise low level; the absence from high level technique of segmentation constitutes a major disadvantage.

Other methods are used for labeling mathematical expressions:

• The method of Lee and Wang [15]: consist to label lines as being mathematical expressions proportionately to the spaces presence before and after the expression. The method is based on heuristics but errors persist such as titles which confuse with mathematical expressions.

• Methods which aim at the zones detection where the presence of mathematical objects are probable. The following step consists to find the correspondence between different present mathematical elementary objects by the arranging of their bounding boxes. The process tries to identify mathematical elementary objects detected and determining liaison between them to a low resolution. This process is very expansive.

• Fatman system [17]: does not use directly information which permit to recognize easily isolated symbols such as the symbol size. It proposes a process of three steps to separate automatically the

mathematical text from the ordinary text. The process uses essentially two bags: one for mathematical objects and the other for words of the ordinary text.

The first step initializes the separation process:

• The groups of component: mathematical special symbols (operators, Greek symbols, scientific symbols, horizontal lines.) italic words and / or bold, roman figures, embraces, brackets, mathematical abreviations (sin, cos, tan.), punctuations…. initialize the mathematical bag.

• Other words initialize textual bag.

The second step corrects mathematical bag contents by the suppression of elements:

• Left brackets without mathematical terms to their right.

• Right brackets where any elements to their lefts is not identified as mathematical bag element.

• Horizontal lines without mathematical bag’s elements at the top, the under or on the right.

• Coma or decimal point isolated in mathematical bag.

The process regroups mathematical symbols proximity. If two bounding boxes are sufficiently near - horizontally or vertically - they will be confused and a new bounding boxes will be creates.

The third step corrects the textual bag contents: several heuristics can decide if mathematical abbreviations must belong to the textual bag or mathematical one (the spatial presence on the right of the word could be an index).

This process has limits. Indeed, generally: • Italic words are confusing as mathematical object.

• Number 1 and the symbol I are confusing.

The human intervening permits to correct the process:

• Identify new mathematical words which have not been recognized by the process.

• Identify words in italic which are not mathematical objects.

Process failures can be to correct if one arrives to construct an intelligent agent which can comprise and analyze both of natural and the mathematical language in order to correct elements belonging to one or other bag.

• The system EXTRAFOR [9] which base on fuzzy logical

presents very interesting results: Almost 100% isolated notations and more than 90% of symbols or included expressions in a paragraph are extracted correctly of the document. The method of working of the system can be summarized at:

Global segmentation phase: Many steps are followed for formula extraction. A set of connected components (CCXs) is extracted and a

set of CCXs’s parameters are calculated (CCXs bounding boxes coordinates, the ratio, the air and the density of each of these CCXs…) basing on these parameters the system may determine if the line contains text or isolated formulae.

Local segmentation phase: in order to extract some embedded formulae in text: a secondary labeling of CCXs has been presented. It is a fuzzy logic based learning.

Here after we review some errors that still persist:

The confusion between indexes, exponents and diacritical or punctuation signs.

The confusion between minus sign and the diacritical hyphens or delimiters.

The number 1 and the letter I.

• Jianming Jin, Xionghu Han and Qingren Wang [10] conceved a system which is able to calculate a number of parameters such as: the line’s height (h) and length (l), all line’s characters, heights’ average (h0), the distance between the actual and higher line (as), the distance between the actual and lower line (bs), the distance between the left-hand delimiter and the start of the line (li), etc.

Every document’s line will be presented such as a vector containing different parameters values treated by a classification Bayesian method. The system extracts the isolated formula of the text. For the embedded formula, the system is based on the estimation computing of the baseline and the meanline.

• Some other approaches exist to treat the most particular case of the mathematical objects: matrix and vectors: e.g. Okamoto M & TWAAKYONDO H.M [20] presented an approach based on marking a pair of delimiters of the same size and the same type. An horizontal projection is applied to the delimited zone, if the projection contains several lines; a vertical projection is applied in order to separate the columns. Hence every element of the matrix may be identified separately.

• The Toumit, Garzia Saliclli and Emptoz model [18].

Results of segmentation

The segmentation phase must provide some others complementary information about symbols such as:

The symbols relative size: it will be qualified by a value that will be integrated as a recognized symbol attribute.

The basic line: the object selection is respectively positioned to this line. The basic line allows indicating the symbol alignment. This information must be provided by the characters recognition system, elsewhere it can be computed taking account of many parameters (e.g. the font).

F. Mathematical object structure recognition

The segmentation phase allows the separation of different mathematical objects of the document then recognizes the different characters of every mathematical object. The next phase consists of regrouping the characters to form the complex mathematical objects.

In order to recognize the mathematical object structure, many phases are necessarily: • Identification of spatial and logical relationship

between symbols;

• Structural analysis of mathematical objects respectively to different links already identified.

G. Symbols’ relationships identification

Almost of the used methods are based on the static labeling of links between symbols. The symbols are converted on a set of bounding boxes. Several parameters allow detection of the different relationships between the peers of bounding boxes.

The reader can consult the reference [7] for an example of these methods. For each peer of these embedded boxes some characteristics are measured (the ratio for height, and vertical shifting) and a label is attributed to each liaison.

The identification takes account only of the symbols position, embedded boxes, but not of the box identity or even the reference line.

Other approaches allow the elimination of legal or illegal elements configurations (sy and y*).

a) Structural analysis

a. Structural methods

Sonfelein [12] noted that all structures of presentation amount threesome families:

The strings representing the succession and the text elements composition or even the document shape’s structuring. • The trees: an hierarchy of nodes having all the same

structure (a value and two lines at left and at right).

• The graphs: is the best adapted representation in the mathematical formulas case, it allows representing different types of relationships between formulae components. A mathematical formulae is, therefore, a graph {N, A} where N is the group of nodes and A is a group of edges between two nodes. Every time where a sub expression is recognized, the correspondent sub-graph into the graph will be replaced.

Among the structural methods, one can note:

i. The syntaxical methods:

The syntaxic methods mechanism is based on the grading of the grammar [5]. Indeed, a great similarity is noted: the language following Chomsky [5] are based on an alphabet precise and on production rules. The mathematical objects are divisible into a set of primitive elements (symbols), recursive structure and of a well-defined syntax. The grammar rules are

used for regrouping symbols and the meaning definition for a group of symbols.

Coordinates grammar:

The reader can consult to R.H Anderson’s PhD thesis (1968) for a more detailed example of this method. The rules of this grammar based method articulate around an operator and define the areas or search the operands. This grammar operates on symbols instead of strings, which are the basic of the classical grammar. It is based on a descendant analysis that has a set of production rules to subdivide a symbol set on subsets. The major disadvantage is the analysis speed. Several extensions exist:

• An extension that proposes a system with an analysis of both upward and downward. In cases of uncertainty, the system provides a list of candidates with their probabilities.

• Another extension that can handle errors. The method is based on a grammar coordinates assigned.

Structure specification schemes

Chang uses these schemes to recognize the structure of the mathematical object; its method is based on the existence of operators which subdivide the scheme into subsets. Rules of subdivisions, one for each operator, determine the spatial links between the different elements. The order of operators is guaranteed by the precedence of the operators.

This method is limited since it deals only with the explicit operators

ii. Logical methods

The logical methods are based on a logical modeling knowledge: • Mathematical knowledge: the syntax of the mathematical

expressions;

• Lexical knowledge;

• Knowledge about the process of writing;

• Procedural knowledge;

• Etc.

The interpretation of an expression is based on its structural description. That yields priority to the structural recognition of expression in relation to the recognition of constituent symbols.

iii. Procedural methods

Gather different knowledge in procedures which justifies the name of the method. The advantage of this method is its speed; its disadvantage is the non reuse and extensions to other notations. The sources of noise are many [14]. After the stage of symbol recognition, a list of non-ordained symbols is constituted. The procedures include symbols into sub expressions; each known sub expression will be replaced by a inclusive box. The grouping rules are simple:

A horizontal line can be classified as long or short. A short bar with symbols above and below will be considered

fraction bar, if no symbol is detected above neither below then the sub-expression is a minus sign...

Okamoto [13] thinks that every recognition approach including the procedural rules implicitly or explicitly may define syntax for expressions that must be recognized. He finds that syntactical approaches are untenable given the wide variety of expressions possible which makes it impossible to define syntax. It is then more practical and suitable to use an implicit syntax defined using procedural rules than an explicit syntax defined with grammar rules.

The Okamoto’s system [13] uses the procedural rules combined with the approach of projection cutting. iv. Projection cutting method The Projection cutting method is used in several applications of image analysis of the documents. In the work of Okamoto [13], the vertical and horizontal projections are used alternately and recursively until no subdivision is possible. The analysis is very inexpensive. The result is a tree of spatial relationships between symbols constituents of the notation. Other processes can accompany treatment for the recognition of notations’ families that the method can not analyze. In the case of handwritten notations, the method has quickly shown its limits.

v. Graph rewriting

The technique of rewriting the graph is a general technique which represents all the informations via an attributed graph that can be updated by the application of rewriting rules.

The process is to represent the different symbols of expression by some nodes whose attributes encapsulate the spatial coordinates of symbols.

The writing rules allow building, correct and edit the arcs (edges) between the nodes. The arcs’ attributes encapsulate information about the precedence and the operators. And determine how a nodes’ set can be grouped into a sub expression.

The rewriting rules allow specifying how a sub-graph correspondent to a recognized sub-expression will be replaced by the process of rewriting. The result will be a single node whose attributes include informations on the meaning of the start expression.

The rewriting of the graph is a promising approach for the recognition of mathematical expressions. The process is based on a robust formalism with theoretical foundations, it consists of 4 stages: • Step 1: build arcs between nodes that represent the

symbols of mathematical notation.

• Step 2: Apply mathematical conventions to remove or correct ambiguous arcs.

• Step 3: grouping nodes following the precedence of operators in sub-expressions.

• Step 4: interpret the sub-expressions.

The major disadvantage is the detection of errors.

vi. Data driven and knowledge driven modules

The approach of Faure and Wang [10] is based on modules led by knowledge and data in order to introduce a system of recognition of mathematical objects. The system is a collection of independent modules and communicating through a shared memory that contains the relational tree to be updated.

It’s based on four fundamental modules:

• data’s acquisition module

• data driven segmentation module which: Builds initial relational tree.

Does not take count of characters typing order.

Contextual information is not used in this stage.

Symbols recognition uses projections into the X and Y-axis.

When projections are failed, the module attempts mask-removal operation: the first phase is to look for the mask: special routines are used to look for expressions like square roots and fraction bars. When this fail, any long lines are sought, other symbols are analyzed using X or Y projection.

• Knowledge driven segmentation module: this module uses specific knowledge to correct the relational tree. The knowledge are about lexical, syntactical rules symbol attaching, and symbol’s shape.

• Once identification of nodes is done, a complex procedure updates the rest of the tree in conformity with corrected nodes.

• Labeling module: this module is used to labeled spatial relationships between handwritten line elements.

The symbols recognition modules and a completely process of syntactical analysis have to be added.

This system divides recognition problem in sub problems, it proves ability concerning handwritten recognition.

The organization of the system as a set of independent modules presents an interesting structure, which eases the task of conception, design, programming and debugging different rules.

vii. Stochastic grammar

One associate to each stochastic grammar production rules a probability. A stochastic grammar seems to be a better approach to use in the mathematical objects recognition. The major problem with this method is the fact that the font will be recognized and the line reference should not be inclined.

P. Chou’s system [10] uses a stochastic grammar to recognize the noise in an equation. The result is optimum. The noise, the ambiguity, and symbols have been well handled.

viii. Multi-dimensional grammar

Using by Kernigham and L. L Cherry [19], the bounding boxes representing symbols are linked by multi-dimensional relationship.

ix. Methods comparison

Several factors are to consider if one have to compare existed methods: • Diversities and variety of sources of data to treat (in-line

data, off-line data).

• Variety of recognition methods:

• Methods that integrate or not the segmentation process.

• Methods that require or not isolate formulas.

• Diversity of mathematical notations

b. structural recognition process

Structural recognition requires tree important phases:

• Lexical analysis which permits to transform notation’s symbols to a lexical units stream.

• Geometrical analysis which builds notation’s graph.

Syntactical analysis which permits to analyze the graph in order to build the abstract syntactical tree.

Lexical analysis: extracting lexical parameters

Lexical analysis converts characters input stream to lexical units output stream which can be treated and analyzed by syntactical analyzer.

This step allows hiding physical representation of lexical units to syntactical analyzer and gathers the information about units in attributes such as the value of the recognized symbol, the graphic details of c symbol (coordinates of the box inclusive), the reference line, and the relative size of the symbol.

Geometry Analysis: Graph building

The data structure most suited to the particular case of mathematical objects and graphs. Two alternatives are possible: • Modeling symbols in the form of nodes, the edges will,

hence, be relations.

• Modeling symbols as edges and nodes are, hence, relations.

Building the graph is based on two fundamental concepts, which are the proximity and direction, combining the results of other segmentation phase and/or lexical analysis.

The Lavirotte’s system [7] divides the plan into 9 regions in order to better model the different relationships between symbols: top left, top right, top (above), left, right, included (inside), bottom left, bottom right, and bottom. These regions are where the appearance of another symbol may be candidates to build a link that is governed by several criteria such as the size of the item, the alignment of inclusive boxes; characters reference lines aligning, etc.

The lexical type can forbid some links.

Restrictions of building graphs Limiting the links between the symbols can lead to failure

of analysis. In the manuscript case:

• Problem of detection at the top or bottom right and aligned.

• Problem of detection symbols too distant from each other and therefore not building links in the graph.

• Problem of detection symbols without logical links too close.

Syntactical analysis: Grammar introduction

The rules of grammar used must make the correct combination of mathematical symbols and define the significance of the result of this consolidation. The grammar used must combine two aspects: combinatorial since the formula is a series of characters and geometric as it is spatial. The assigned graph grammars are a type of grammar very appropriate to the case of mathematical formulas. They consist on rewriting a graph or a sub-graph and to iterate this process. The rules of grammar allow specifying the type of sub-graph sought and how the process of rewriting will replace it.

Each symbol of grammar has a set of attributes. Two attributes’ types exist:

• The synthesized attributes whose values are based on the values of the nodes son’s attributes in the abstract syntax tree.

• The inherited attributes whose values are based on values of nodes brothers and fathers.

The attributes can encapsulate the semantic that can be passed to the root of the tree of abstract syntax (synthesized attributes) or to the leaves (inherited attributes). Using attributes the semantic information can be transferred from anywhere to anywhere in the abstract syntax tree in a formal way.

IV. SUMMARY

The robustness and the reliability of the recognition system are considered according to several characteristics such as:

• The reuse: almost of the approaches are specific to a sub set of mathematical expression, (e.g. Fateman system for the integral recognition cannot be reused for other notation types).

• The evolution: the recognition system must be able to adopt the mathematical notations variations and, eventually, the appearance of some newest.

• The extraction of the mathematical object’ meaning: the structural recognition must make out an abstracted syntax tree from the geometrical data, and give the necessary semantically information to this tree in order to exploit the result. This information is obtained from the textual context where the expression has been extracted, the standard conventions…

The main problem of mathematical expressions recognition is the intersection of several research axes:

• acquisition and data pre-processing methods , • physical and logical segmentation methods, • recognition of the symbols methods, • Approaches and methods of recognition of the

structure of mathematical expression. A recognition system is complete if it allows:

• using a symbol recognize process;

• Treating errors and incertitude of the symbols recognition phase.

V. CONCLUSION The study aims to define another representation method. It seems to be interesting to integrate the full recognition process. To compare the existed methods one must take account of several factors such as:

• Variety of recognition methods, of mathematical notations and of data sources: (On-line or off-line).

• Methods allowing or not the integration and the segmentation process and what if they require, or not, isolated expressions.

REFERENCES

[1] Luca Padovani “On the Roles of LATEX and MathML in Encoding and Processing Mathematical Expressions”, A.Asperti. B.Buchberger, J.H. Davenport(EDs). MKM 2003, LNCS 2594, pp.66-79 2003.

[2] M. Kohlhase, “Mathematical Objects(Module MOBJ)”, OmDoc, LNAI, 4180 pp. 107-120, 2006

[3] Jurgen Stuber and Mark Van Brand, “Extracting Mathematical Semantics from LATEX Documents”, LNCSI 2901, pp 160-173 2003.

[4] B. F. Talla and G. E. Koumou, “une approche formelle de description et de manipulation des objets structures mathématiques”, Novembre 2005, Numero Special CARI’04, Revue ARIMA.

[5] G. Martin,”computer input/output of mathematic expressions”, 1971 [6] P. Chou, “recognition of equations using a 2 dimensional stochastic

context free grammar”, In Proceeding SPIE Conference on Visual Communication and Image Processing iV, page 852-863, Philadelphie, Etats-Unis, November 1989.

[7] S. Lavirotte, “ Reconnaissance structurelle des formules mathématiques typographiées et manuscrites”, PhD Thesis 2000

[8] J. Y. Toumit, H. Emptoz, "From the segmentation to the reading of a mathematical document", GKPO’98, Machine Graphics & Vision.

[9] A. Blaid, A. Kacem, M. Ben Ahmed, “Formulas extraction from mathematical documents”, RFIA’00, FRANCE, 2000

[10] Y. Yorozu, M. Hirano, K. Oka, and Y. Tagawa, “Electron spectroscopy studies on magneto-optical media and plastic substrate interface”, IEEE Transl. J. Magn. Japan, vol. 2, pp. 740–741, August 1987 [Digests 9th Annual Conference Magnetics Japan, p. 301, 1982].

[11] Jianming Jin, Xionghu Han, Qingren Wang, “Mathematical Formulas Extraction”, Institute of Machine Intelligence, Nankai University, Tianjin, China, 300071.

[12] A. Blaid, “Panorama des méthodes structurelles en analyse et reconnaissance des documents“ , Y.CHENEVOY, F.PARMENTIER (1997)

[13] M.Okamoto and B.Miao, “Recognition of Mathematical expressions by using the layout structure of symbols”, in Proceeding of the First International Conference on Document Analysis and Recognition, pp. 242-250, Saint Malo, France, 1991.

[14] H. Lee and M. Lee, “Understanding mathematical expressions using procedure-oriented transformation”, Pattern Recognition 27, 3 (1994) pp. 447-457.

[15] Lee, Wang “Design of mathematical expression recognition system”, ICDAR’95, Japan, 1995, pp. 1084-1087

[16] Fateman, “Optical Character recognition and parsing of typeset Mathematics”, J. of visual Commun and Image Representation vol7 no.1 (March 1996),2-15

[17] Fateman, “How to find Maths on the scanned page”, septembre 1997

[18] Toumit, Garcia Salicelli et HEmptoz,“A hierarchical and recursive model of mathematical expressions for automatic reading of mathematic documents”, In Proceeding of the fifth international Conference on Document Analysis and Recognition(ICDAR) page 119-122 Bangalone, India, September 1999, IEEE Computer Society Press

[19] Kernigham, L.Cherry “A system for typesetting mathematics” [20] M. Okamoto and H. M. Twaakyoudo, “a Structure Analysis and

recognition of mathematical expressions”, ICDAR’95 CANADA,1995 ,pp.430-437