DocumentCode
3761230
Title
Document categorization using semantic relatedness & Anaphora resolution: A discussion
Author
Kaustubh Dhole;Harsh Kohli
Author_Institution
Department of Electrical and Electronics Engineering, Department of Biological Sciences, BITS, Pilani Goa Campus
fYear
2015
Firstpage
439
Lastpage
443
Abstract
Document categorization is the process of assigning pre-defined categories to textual documents. State-of-the art approaches have modelled documents in terms of corpus-length long vectors and viewed the problem only from a syntactic perspective. We develop a general measure to estimate the semantic closeness of documents by utilizing the semantic relatedness of the most discriminative individual words that define the document. Anaphora resolution is used to strengthen the meaning ascribed to each document. Our framework benefits from word semantics and the Wordnet taxonomy thus better capturing the underlying meaning of the text and proves to be a more concise representation than traditional Information Retrieval methods. Having the same representation for documents as well as for a category of documents and associating a measure of semantic closeness paves way for modelling documents into a semantic space where unsupervised approaches can be easily used. We evaluate the performance of our measure by implementing it to categorize news documents into two topics and achieve 81 to 92% accuracy.
Keywords
"Semantics","Measurement","Taxonomy","Image resolution","Electronic mail","Natural language processing","Syntactics"
Publisher
ieee
Conference_Titel
Research in Computational Intelligence and Communication Networks (ICRCICN), 2015 IEEE International Conference on
Type
conf
DOI
10.1109/ICRCICN.2015.7434279
Filename
7434279
Link To Document