Title :
Word classification in bilingual printed documents
Author :
Haboubi, Sofiene ; Maddouri, Samia ; Amiri, Hamid
Author_Institution :
Image & Inf. Techno logies Lab., Nat. Eng. Sch. of Tunis, Tunis, Tunisia
Abstract :
In this paper we propose a method of identifying Arabic words from Arabic and Latin scripts in printed documents. This method is based on a statistical and geometric analysis to separate between words of a printed document. Structural features are used to describe the words extracted in previous step. Among the features used: the jambs, the diacritical points, the connected components, the hamps... From these characteristics, we construct our vector that allows the description. Functions of neural networks are used to classify the different words extracted. Classification is according to two classes Arabic or Latin. We present the found results of classification step, with a discussion on possible improvements.
Keywords :
geometry; natural language processing; neural nets; pattern classification; statistical analysis; text analysis; Arabic words; Latin scripts; bilingual printed documents; connected components; geometric analysis; neural networks; printed documents; statistical analysis; structural features; word classification; words extraction; Character recognition; Feature extraction; Gabor filters; Optical character recognition software; Text analysis; Writing; Language identification; structural features; word extraction;
Conference_Titel :
Sciences of Electronics, Technologies of Information and Telecommunications (SETIT), 2012 6th International Conference on
Conference_Location :
Sousse
Print_ISBN :
978-1-4673-1657-6
DOI :
10.1109/SETIT.2012.6481963