• DocumentCode
    2452740
  • Title

    Chinese Web Text Outlier Mining Based on Domain Knowledge

  • Author

    Huosong, Xia ; Zhaoyan, Fan ; Liuyan, Peng

  • Author_Institution
    Dept. of Inf. Manage. & Inf. Syst., Wuhan Textile Univ., Wuhan, China
  • Volume
    2
  • fYear
    2010
  • fDate
    16-17 Dec. 2010
  • Firstpage
    73
  • Lastpage
    77
  • Abstract
    Web text mining is a growing research area in data mining. Interestingly, the existing Web text mining algorithms have concentrated on finding frequent patterns while discarding the less frequent ones that may contain outliers. In addition, the domain knowledge in one industry is partly different from that in the others. Whatever they belong to, web texts are analyzed using the same dictionary. This paper proposes formal definitions of Web text outliers and Web text outlier mining, and presents a framework of Web text outlier mining based on domain knowledge. To verify the feasibility of the framework, an algorithm for mining Chinese Web text outliers is proposed based on improved VSM and n-grams. Experimental results with insurance topic show that the mining algorithm is effectively capable of finding Chinese Web text outliers from web text data, and has higher precision and recall and lower complexity.
  • Keywords
    Internet; data mining; natural languages; text analysis; Chinese Web text outlier mining; VSM; Web text data; Web text mining algorithm; domain knowledge; formal definition; n-gram; Accuracy; Algorithm design and analysis; Data mining; HTML; Industries; Knowledge engineering; Web pages; dissimilarity measures; domain knowledge; insurance topic; n-grams; web text outliers;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Intelligent Systems (GCIS), 2010 Second WRI Global Congress on
  • Conference_Location
    Wuhan
  • Print_ISBN
    978-1-4244-9247-3
  • Type

    conf

  • DOI
    10.1109/GCIS.2010.66
  • Filename
    5708790