• DocumentCode
    684851
  • Title

    A new algorithm: Extracting text information from Webpage based on block and tag-function

  • Author

    Dingrong Yuan ; Xiaohu Yang ; Xue Nong ; Huiwen Fu

  • Author_Institution
    Coll. of Comput. Sci. & Inf. Technol., Guangxi Normal Univ., Guilin, China
  • fYear
    2012
  • fDate
    7-9 Dec. 2012
  • Firstpage
    1
  • Lastpage
    4
  • Abstract
    A Webpage contains lots of information that users needed, however it also fills with plenty of noises. How to remove these noises and extract useful text information has become one of the hottest topics in the field of Web data mining. This paper proposes a text information extraction algorithm based on visual information and tag-function. In this algorithm, firstly a webpage is divided into different blocks, and then we extract text information from these blocks based on rules, which are extracted from the characteristics of tag-function. Experiments show that the algorithm is effective and efficient.
  • Keywords
    Web sites; data mining; text analysis; Web data mining; Webpage; block-function; tag-function; text information extraction algorithm; DOM tree; information extraction; tag-function; text information; visual block;
  • fLanguage
    English
  • Publisher
    iet
  • Conference_Titel
    Information Science and Control Engineering 2012 (ICISCE 2012), IET International Conference on
  • Conference_Location
    Shenzhen
  • Electronic_ISBN
    978-1-84919-641-3
  • Type

    conf

  • DOI
    10.1049/cp.2012.2437
  • Filename
    6755816