• DocumentCode
    2023915
  • Title

    A Fast and Accurate Approach for Main Content Extraction Based on Character Encoding

  • Author

    Mohammadzadeh, Hadi ; Gottron, T. ; Schweiggert, Franz ; Nakhaeizadeh, Gholamreza

  • Author_Institution
    Inst. of Appl. Inf. Process., Univ. of Ulm, Ulm, Germany
  • fYear
    2011
  • fDate
    Aug. 29 2011-Sept. 2 2011
  • Firstpage
    167
  • Lastpage
    171
  • Abstract
    This paper presents a novel approach for extracting the main content from Web documents written in languages not based on the Latin alphabet. In practice, the HTML tags are based on the English language and, certainly, the English character set is encoded in the interval [0,127] of the Unicode character set. On the other hand, many languages, such as the Arabic language, use a different interval for their characters. In the first phase of our approach, we apply this distinction for a fast separation of the Non-ASCII from the English characters. After that, we determine some areas of the HTML file with high density of the Non-ASCII character set and low density of the ASCII character set. At the end of this phase, we use this density to identify the areas which contain the main content. Finally, we feed those areas to our parser in order to extract the main content of the Web page. The proposed algorithm, called DANA, exceeds alternative approaches in terms of both, efficiency and effectiveness, and has the potential to be extended also to languages based on ASCII characters.
  • Keywords
    Internet; hypermedia markup languages; information retrieval; natural languages; Arabic language; DANA; English character set; English language; HTML tags; Web documents; Web page; character encoding; main content extraction; nonASCII; unicode character set; Electronic publishing; Encoding; Encyclopedias; HTML; Internet; Web pages; ASCII and Non-ASCII character set; HTML Documents; Information Retrieval; Main Content Extraction; UTF-8;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Database and Expert Systems Applications (DEXA), 2011 22nd International Workshop on
  • Conference_Location
    Toulouse
  • ISSN
    1529-4188
  • Print_ISBN
    978-1-4577-0982-1
  • Type

    conf

  • DOI
    10.1109/DEXA.2011.2
  • Filename
    6059811