• DocumentCode
    2758713
  • Title

    A Distributed Text Mining System for Online Web Textual Data Analysis

  • Author

    Zhou, Bin ; Jia, Yan ; Liu, Chunyang ; Zhang, Xu

  • Author_Institution
    Sch. of Comput., Nat. Univ. of Defense Technol., Changsha, China
  • fYear
    2010
  • fDate
    10-12 Oct. 2010
  • Firstpage
    1
  • Lastpage
    4
  • Abstract
    Real world Web mining applications usually have different requirements, such as massive data processing, low system latency, and high scalability. In order to meet these different requirements, we proposed a distributed text mining system with a layered architecture that divides the system functions into three layers, namely, the crawling and storage layer, the basic mining layer, and the analysis service layer. Message-oriented middleware are used between these layer components and services to make the communication in a loosely-coupled way. To conquer the data-intensive and storage failure problems, a distributed file system is used to store and manage the raw text data and various indexes. As a case study and example, the design and implementation of an experimental online topic detection application, which can be scaled to handle thousands of Internet news and forum channels and perform online analysis, is also discussed.
  • Keywords
    Internet; data analysis; data mining; middleware; text analysis; Internet news; Web mining; analysis service layer; crawling layer; data processing; distributed file system; distributed text mining system; forum channels; layered architecture; message-oriented middleware; mining layer; online Web textual data analysis; online analysis; online topic detection; raw text data management; raw text data storage; storage layer; system latency; Crawlers; Data analysis; Distributed databases; Internet; Text mining; distributed computing; information discovery; text mining;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC), 2010 International Conference on
  • Conference_Location
    Huangshan
  • Print_ISBN
    978-1-4244-8434-8
  • Electronic_ISBN
    978-0-7695-4235-5
  • Type

    conf

  • DOI
    10.1109/CyberC.2010.11
  • Filename
    5615662