• DocumentCode
    3666564
  • Title

    Optimal stop word selection for text mining in critical infrastructure domain

  • Author

    Kasun Amarasinghe;Milos Manic;Ryan Hruska

  • Author_Institution
    Virginia Commonwealth University Richmond, Virginia, USA
  • fYear
    2015
  • fDate
    8/1/2015 12:00:00 AM
  • Firstpage
    1
  • Lastpage
    6
  • Abstract
    Eliminating all stop words from the feature space is a standard practice of preprocessing in text mining, regardless of the domain which it is applied to. However, this may result in loss of important information, which adversely affects the accuracy of the text mining algorithm. Therefore, this paper proposes a novel methodology for selecting the optimal set of domain specific stop words for improved text mining accuracy. First, the presented methodology retains all the stop words in the text preprocessing phase. Then, an evolutionary technique is used to extract the optimal set of stop words that result in the best classification accuracy. The presented methodology was implemented on a corpus of open source news articles related to critical infrastructure hazards. The first step of mining geo-dependencies among critical infrastructures from text is text classification. In order to achieve this, article content was classified into two classes: 1) text content with geo-location information, and 2) text content without geo-location information. Classification accuracy presented methodology was compared to accuracies of four other test cases. Experimental results with 10-fold cross validation showed that the presented method yielded an increase of 1.76% or higher in True Positive (TP) rate and a 2.27% or higher increase in the True Negative (TN) rate compared to the other techniques.
  • Keywords
    "Time division multiplexing","Text mining","Genetic algorithms","Standards","Accuracy","Text categorization"
  • Publisher
    ieee
  • Conference_Titel
    Resilience Week (RWS), 2015
  • Type

    conf

  • DOI
    10.1109/RWEEK.2015.7287440
  • Filename
    7287440