• DocumentCode
    3761230
  • Title

    Document categorization using semantic relatedness & Anaphora resolution: A discussion

  • Author

    Kaustubh Dhole;Harsh Kohli

  • Author_Institution
    Department of Electrical and Electronics Engineering, Department of Biological Sciences, BITS, Pilani Goa Campus
  • fYear
    2015
  • Firstpage
    439
  • Lastpage
    443
  • Abstract
    Document categorization is the process of assigning pre-defined categories to textual documents. State-of-the art approaches have modelled documents in terms of corpus-length long vectors and viewed the problem only from a syntactic perspective. We develop a general measure to estimate the semantic closeness of documents by utilizing the semantic relatedness of the most discriminative individual words that define the document. Anaphora resolution is used to strengthen the meaning ascribed to each document. Our framework benefits from word semantics and the Wordnet taxonomy thus better capturing the underlying meaning of the text and proves to be a more concise representation than traditional Information Retrieval methods. Having the same representation for documents as well as for a category of documents and associating a measure of semantic closeness paves way for modelling documents into a semantic space where unsupervised approaches can be easily used. We evaluate the performance of our measure by implementing it to categorize news documents into two topics and achieve 81 to 92% accuracy.
  • Keywords
    "Semantics","Measurement","Taxonomy","Image resolution","Electronic mail","Natural language processing","Syntactics"
  • Publisher
    ieee
  • Conference_Titel
    Research in Computational Intelligence and Communication Networks (ICRCICN), 2015 IEEE International Conference on
  • Type

    conf

  • DOI
    10.1109/ICRCICN.2015.7434279
  • Filename
    7434279