Title :
Document categorization using semantic relatedness & Anaphora resolution: A discussion
Author :
Kaustubh Dhole;Harsh Kohli
Author_Institution :
Department of Electrical and Electronics Engineering, Department of Biological Sciences, BITS, Pilani Goa Campus
Abstract :
Document categorization is the process of assigning pre-defined categories to textual documents. State-of-the art approaches have modelled documents in terms of corpus-length long vectors and viewed the problem only from a syntactic perspective. We develop a general measure to estimate the semantic closeness of documents by utilizing the semantic relatedness of the most discriminative individual words that define the document. Anaphora resolution is used to strengthen the meaning ascribed to each document. Our framework benefits from word semantics and the Wordnet taxonomy thus better capturing the underlying meaning of the text and proves to be a more concise representation than traditional Information Retrieval methods. Having the same representation for documents as well as for a category of documents and associating a measure of semantic closeness paves way for modelling documents into a semantic space where unsupervised approaches can be easily used. We evaluate the performance of our measure by implementing it to categorize news documents into two topics and achieve 81 to 92% accuracy.
Keywords :
"Semantics","Measurement","Taxonomy","Image resolution","Electronic mail","Natural language processing","Syntactics"
Conference_Titel :
Research in Computational Intelligence and Communication Networks (ICRCICN), 2015 IEEE International Conference on
DOI :
10.1109/ICRCICN.2015.7434279