• DocumentCode
    2196197
  • Title

    A Smart Filtering System for Newly Coined Profanities by Using Approximate String Alignment

  • Author

    Yoon, Taijin ; Park, Sun-Young ; Cho, Hwan-Gue

  • Author_Institution
    Dept. of Comput. Sci., Pusan Nat. Univ., Busan, South Korea
  • fYear
    2010
  • fDate
    June 29 2010-July 1 2010
  • Firstpage
    643
  • Lastpage
    650
  • Abstract
    Verbal abuse is becoming a serious social problem in online communication, because anonymity makes it easier to use profanities. Detecting and removing some words that have been registered in a forbidden list is a straightforward filtering method. This is simple, but preparing the forbidden word list is difficult as newly coined words have to be added to the lexicon. Especially Korean is a type of agglutinative language, so the construction of new variations of a vulgar word is easy without causing difficulties in textual communications in an online environment. In this paper we propose a new method to detect all variations of a vulgar word with phoneme modification by applying a phoneme based string alignment. However, aligning a query word against all vulgar words registered in a database takes time and its computation is difficult. We propose a R*-tree based searching algorithm to overcome this expensive computation. The method applies the metric space property of string edit distance. We prepared a word database with more than 9300 prototype vulgar words for experiment. For a given query word, our algorithm quickly finds the best-aligned candidate word(0.006 sec. with 1000 words), which are within an edit distance equals of one unit. Our contribution is that we empirically found the number of pivot words to create a near optimal searching space.
  • Keywords
    information filtering; natural language processing; R*-tree based searching algorithm; agglutinative language; approximate string alignment; online communication; profanities; smart filtering system; textual communications; verbal abuse; vulgar word; Approximation algorithms; Filtering; Games; Indexes; Measurement; Sun; approximate string matching; profanity filter; sequence alignment;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer and Information Technology (CIT), 2010 IEEE 10th International Conference on
  • Conference_Location
    Bradford
  • Print_ISBN
    978-1-4244-7547-6
  • Type

    conf

  • DOI
    10.1109/CIT.2010.129
  • Filename
    5578129