Abstract :
In this paper we addressed the problem of Arabic mathematical formula recognition, extracted from scanned images of clearly printed documents. Two main stages are followed by the proposed system: symbol recognition and structural analysis of the mathematical formula. For the first stage, our system uses a combination of different statistical features like Run length, Hu and Zernike moments, Bi-level co-occurrence and white pixel´s portion and an instance-based classifier K*. High accuracy for the recognition of isolated mathematical symbols is achieved. In the second stage, the system proceeds by top-down and bottom-up parsing scheme based on operator dominance. A set of replacement rules is defined by a coordinate grammar based on symbol recognition and symbol arrangement analysis results. In the proposed system, the recognition and parsing modules interact more closely. Thus, we can use the context information collected during structural analysis to help us guess about the symbols, overcoming our incorrect assumption of perfect symbol recognition. The system provides output in MathML which is easily transmitted for subsequent processing by computer algebra systems. The syntax-directed recognition system, described here, has been successfully demonstrated in many types of formulas and achieved satisfactory results. 91% of formulas are correctly recognized.