DocumentCode :
1455288
Title :
Empirical performance evaluation methodology and its application to page segmentation algorithms
Author :
Mao, Song ; Kanungo, Tapas
Author_Institution :
Center for Autom. Res., Maryland Univ., College Park, MD, USA
Volume :
23
Issue :
3
fYear :
2001
fDate :
3/1/2001 12:00:00 AM
Firstpage :
242
Lastpage :
256
Abstract :
While numerous page segmentation algorithms have been proposed in the literature, there is lack of comparative evaluation of these algorithms. In the existing performance evaluation methods, two crucial components are usually missing: 1) automatic training of algorithms with free parameters and 2) statistical and error analysis of experimental results. We use the following five-step methodology to quantitatively compare the performance of page segmentation algorithms: 1) first, we create mutually exclusive training and test data sets with groundtruth, 2) we then select a meaningful and computable performance metric, 3) an optimization procedure is then used to search automatically for the optimal parameter values of the segmentation algorithms on the training data set, 4) the segmentation algorithms are then evaluated on the test data set, and, finally, 5) a statistical and error analysis is performed to give the statistical significance of the experimental results. In particular, instead of the ad hoc and manual approach typically used in the literature for training algorithms, we pose the automatic training of algorithms as an optimization problem and use the simplex algorithm to search for the optimal parameter value. A paired-model statistical analysis and an error analysis are then conducted to provide confidence intervals for the experimental results of the algorithms. This methodology is applied to the evaluation of live page segmentation algorithms of which, three are representative research algorithms and the other two are well-known commercial products, on 978 images from the University of Washington III data set. It is found that the performance indices of the Voronoi, Docstrum, and Caere segmentation algorithms are not significantly different from each other, but they are significantly better than that of ScanSoft´s segmentation algorithm, which, in turn, is significantly better than that of X-Y cut
Keywords :
document image processing; image segmentation; optical character recognition; optimisation; search problems; statistical analysis; Caere segmentation algorithms; Docstrum segmentation algorithms; Voronoi algorithms; automatic training; confidence intervals; empirical performance evaluation methodology; error analysis; optimization procedure; page segmentation algorithms; paired-model statistical analysis; performance metric; simplex algorithm; statistical analysis; Automatic testing; Character recognition; Error analysis; Image segmentation; Measurement; Optical character recognition software; Optimization methods; Performance evaluation; Statistical analysis; Training data;
fLanguage :
English
Journal_Title :
Pattern Analysis and Machine Intelligence, IEEE Transactions on
Publisher :
ieee
ISSN :
0162-8828
Type :
jour
DOI :
10.1109/34.910877
Filename :
910877
Link To Document :
بازگشت