• DocumentCode
    2819499
  • Title

    Automatic extraction of non-textual information in web document and their classification

  • Author

    Zachariasova, Martina ; Hudec, Robert ; Benco, Miroslav ; Kamencay, Patrik

  • Author_Institution
    Dept. of Telecommun. & Multimedia, Univ. of Zilina, Zilina, Slovakia
  • fYear
    2012
  • fDate
    3-4 July 2012
  • Firstpage
    753
  • Lastpage
    757
  • Abstract
    This paper deals with research in the area of automatic extraction of textual and non-textual information and their classification. The main idea is to create a robust method for extraction of image and textual segments to obtain short web document. Thus, developed method consist of two data types extractions, where both image and text data extraction are using Document Object Model tree. Extracted objects are saved in separate databases followed the images analysis that define and describe image object from semantic point of view. Moreover, the semantic description of all modal objects are utilized to short web document creation. To accurate object classification, the fast and powerful hybrid segmentation algorithm based on Mean Shift and Believe Propagation principles are mentioned in this paper, too. Likewise, the image segmentation algorithm was integrated with SIFT descriptor. Finally, in order to obtain a semantic description of objects in static image, the SVM classification is used. The developed method was tested on real unsegmented and segmented images, too.
  • Keywords
    feature extraction; image classification; image retrieval; image segmentation; information retrieval; support vector machines; text analysis; text detection; trees (mathematics); SIFT descriptor; SVM classification; Web document; automatic nontextual information extraction; automatic textual information extraction; believe propagation principles; databases; document object model tree; hybrid segmentation algorithm; image extraction; image segmentation algorithm; images analysis; information classification; mean shift principles; object classification; object semantic description; segmented images; static image; text data extraction; textual segment extraction; unsegmented images; Algorithm design and analysis; Data mining; Image segmentation; Semantics; Support vector machines; Testing; Training; DOM; SIFT descriptor; SVM classification; extraction images; segmentation;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Telecommunications and Signal Processing (TSP), 2012 35th International Conference on
  • Conference_Location
    Prague
  • Print_ISBN
    978-1-4673-1117-5
  • Type

    conf

  • DOI
    10.1109/TSP.2012.6256398
  • Filename
    6256398