Title :
Identification of deliberately doctored text documents using frequent keyword chain (FKC) model
Author :
Kaza, Siddharth ; Murthy, S. N Jayaram ; Hu, Gongzhu
Author_Institution :
Comput. Sci. Dept., Central Michigan Univ., Mount Pleasant, MI, USA
Abstract :
Text documents have always been the most dominant source of data available. A number of classification techniques are used to organize these documents and a majority of these classification algorithms use keywords to categorize them. It is possible to mislead such algorithms by inserting keywords (´deliberate doctoring´) belonging to a class different from that of the document. Such intentional deception is done in order to rank Web pages higher in searches. As text classification is used to classify e-mails, deliberate doctoring is also done as a spam filter-busting measure. In addition, it may be practiced to avoid detection by security agencies. The cost of such misclassification can be high and it is a serious problem in many scenarios. In this paper we have exhaustively examined the possible methods to doctor a document which may lead to its misclassification. In the study we have concluded that a majority of the ways would involve insertion of a number of misleading keywords in close proximity. We propose the frequent keyword chain model to identify such local concentration of keywords. A tool called the FKCLocater is designed around the model which identifies and highlights FKC´s in a document and alerts the user to the possibility of misclassification. The tool is also used to specify various parameters to fine tune the frequency keyword chain model. Experiments on newsgroup data sets show that this model is effective.
Keywords :
classification; text analysis; FKCLocater tool; Web page ranking; data formats; deliberately doctored text document identification; frequent keyword chain model; keyword insertion; misleading keywords; spam filter-busting measure; text classification; Classification algorithms; Classification tree analysis; Computer science; Costs; Electronic mail; Internet; Search engines; Security; Text categorization; Web pages;
Conference_Titel :
Information Reuse and Integration, 2003. IRI 2003. IEEE International Conference on
Print_ISBN :
0-7803-8242-0
DOI :
10.1109/IRI.2003.1251443