Abstract :
An automated system capable of recognizing responses for questionnaires and entering them into the database will be very useful in many subjects. Entering data manually is time consuming. Thus, the purpose of the research is to automate the manual data entry process. Through this research, a new clustering method to cluster printed and handwritten words, and character recognition method to identify each character of handwritten words was discovered. An automated system which recognizes response should be capable of separating printed words from the handwritten answers in a questionnaire, and recognizing each character in the handwritten word. Horizontal project profile was used to segment the lines of the scanned questionnaire and vertical project profile to segment the words in each line. Two types of data, characters and words were collected using a questionnaire. Characters including 26 English upper case alphabet characters, 10 numeric characters and 3 main symbols, dot (.), at (@) and dash (-) and words including printed words and handwritten words, The target population was students of University of Colombo, Faculty of Science with a population size of 2000. Stratified sampling is the method which used to collect data. Sample size was chosen as 300 where the marginal error of the sampling is 0.05. Thus, 16 strata were created considering facts gender, stream of study and year of academy. Six features are identified as height, pixel density, pixel distribution, vertical project variance, major vertical edge and major horizontal project profile, to cluster the printed and handwritten words. Results discovered that agglomerative hierarchical clustering provides highest recall accuracy of 98%. Complete distance linkage and Euclidean distances maximize the Cophenetic correlation coefficient as 0.8874. Once recognize the handwritten words, vertical project profile was used to separate characters of the word. 16 partial densities were calculated for each character a- features. Assuming that the large number of data behaves according to Gaussian distribution, Probabilistic neural network was created with an input layer which contains 16 partial densities as variables and output layer which results 39 classes including 26 English upper case characters, to numerical characters and 3 symbols. System shows the recall accuracy as 71.4% when spread was considered as 14. Major drawback of the system was the difficulty of separating number 0 and character 0, number I and character I, number 2 and character Z, number 5 and character S. This was a reason to reduce the accuracy of recognizing characters. Still, the system provides a better solution to automate the data entering of a questionnaire by providing high efficiency.
Keywords :
Gaussian distribution; correlation methods; data acquisition; handwritten character recognition; neural nets; pattern clustering; sampling methods; text analysis; Cophenetic correlation coefficient; English upper case characters; Euclidean distances; Gaussian distribution; agglomerative hierarchical clustering method; automated data entry process; automated response recognition system; character recognition method; data collection; distance linkage; handwritten answers; handwritten word character identification; major horizontal project profile; major vertical edge; numerical characters; pixel density; pixel distribution; printed word; probabilistic neural network; recall accuracy; scanned questionnaire line segmentation; stratified sampling; symbols; vertical project profile; vertical project variance; word height; word segmentation; Accuracy; Computers; Handwriting recognition; Image recognition; Image segmentation; Manuals; White spaces; Clustering; Probabilistic Neural Network; Text recognition;