DocumentCode :
2196197
Title :
A Smart Filtering System for Newly Coined Profanities by Using Approximate String Alignment
Author :
Yoon, Taijin ; Park, Sun-Young ; Cho, Hwan-Gue
Author_Institution :
Dept. of Comput. Sci., Pusan Nat. Univ., Busan, South Korea
fYear :
2010
fDate :
June 29 2010-July 1 2010
Firstpage :
643
Lastpage :
650
Abstract :
Verbal abuse is becoming a serious social problem in online communication, because anonymity makes it easier to use profanities. Detecting and removing some words that have been registered in a forbidden list is a straightforward filtering method. This is simple, but preparing the forbidden word list is difficult as newly coined words have to be added to the lexicon. Especially Korean is a type of agglutinative language, so the construction of new variations of a vulgar word is easy without causing difficulties in textual communications in an online environment. In this paper we propose a new method to detect all variations of a vulgar word with phoneme modification by applying a phoneme based string alignment. However, aligning a query word against all vulgar words registered in a database takes time and its computation is difficult. We propose a R*-tree based searching algorithm to overcome this expensive computation. The method applies the metric space property of string edit distance. We prepared a word database with more than 9300 prototype vulgar words for experiment. For a given query word, our algorithm quickly finds the best-aligned candidate word(0.006 sec. with 1000 words), which are within an edit distance equals of one unit. Our contribution is that we empirically found the number of pivot words to create a near optimal searching space.
Keywords :
information filtering; natural language processing; R*-tree based searching algorithm; agglutinative language; approximate string alignment; online communication; profanities; smart filtering system; textual communications; verbal abuse; vulgar word; Approximation algorithms; Filtering; Games; Indexes; Measurement; Sun; approximate string matching; profanity filter; sequence alignment;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computer and Information Technology (CIT), 2010 IEEE 10th International Conference on
Conference_Location :
Bradford
Print_ISBN :
978-1-4244-7547-6
Type :
conf
DOI :
10.1109/CIT.2010.129
Filename :
5578129
Link To Document :
بازگشت