DocumentCode :
2819499
Title :
Automatic extraction of non-textual information in web document and their classification
Author :
Zachariasova, Martina ; Hudec, Robert ; Benco, Miroslav ; Kamencay, Patrik
Author_Institution :
Dept. of Telecommun. & Multimedia, Univ. of Zilina, Zilina, Slovakia
fYear :
2012
fDate :
3-4 July 2012
Firstpage :
753
Lastpage :
757
Abstract :
This paper deals with research in the area of automatic extraction of textual and non-textual information and their classification. The main idea is to create a robust method for extraction of image and textual segments to obtain short web document. Thus, developed method consist of two data types extractions, where both image and text data extraction are using Document Object Model tree. Extracted objects are saved in separate databases followed the images analysis that define and describe image object from semantic point of view. Moreover, the semantic description of all modal objects are utilized to short web document creation. To accurate object classification, the fast and powerful hybrid segmentation algorithm based on Mean Shift and Believe Propagation principles are mentioned in this paper, too. Likewise, the image segmentation algorithm was integrated with SIFT descriptor. Finally, in order to obtain a semantic description of objects in static image, the SVM classification is used. The developed method was tested on real unsegmented and segmented images, too.
Keywords :
feature extraction; image classification; image retrieval; image segmentation; information retrieval; support vector machines; text analysis; text detection; trees (mathematics); SIFT descriptor; SVM classification; Web document; automatic nontextual information extraction; automatic textual information extraction; believe propagation principles; databases; document object model tree; hybrid segmentation algorithm; image extraction; image segmentation algorithm; images analysis; information classification; mean shift principles; object classification; object semantic description; segmented images; static image; text data extraction; textual segment extraction; unsegmented images; Algorithm design and analysis; Data mining; Image segmentation; Semantics; Support vector machines; Testing; Training; DOM; SIFT descriptor; SVM classification; extraction images; segmentation;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Telecommunications and Signal Processing (TSP), 2012 35th International Conference on
Conference_Location :
Prague
Print_ISBN :
978-1-4673-1117-5
Type :
conf
DOI :
10.1109/TSP.2012.6256398
Filename :
6256398
Link To Document :
بازگشت