• DocumentCode
    2730752
  • Title

    A Knowledge Discovery Methodology for Semantic Categorization of Unstructured Textual Sources

  • Author

    Toti, D. ; Atzeni, P. ; Polticelli, F.

  • Author_Institution
    Dept. of Comput. Sci. & Eng., Roma Tre Univ., Rome, Italy
  • fYear
    2012
  • fDate
    25-29 Nov. 2012
  • Firstpage
    944
  • Lastpage
    951
  • Abstract
    We describe a methodology for identifying characterizing terms from a source text or paper and automatically building an ontology around them, with the purpose of semantically categorizing a paper corpus where documents sharing similar subjects may be subsequently clustered together by means of ontology alignment. We first employ a Natural Language Processing pipeline to extract relevant terms from the source text, and then use a combination of a pattern-based and machine-learning approach to establish semantic relationships among those terms, with some user´s feedback required in-between. This methodology for discovering characterizing knowledge from textual sources finds its inception as an extension of PRAISED, our abbreviation discovery framework, in order to enhance its resolution capabilities. By moving from a paper-by-paper, mainly syntactical process to a corpus-based, semantic approach, it was in fact possible to overcome earlier limits of the system related to abbreviations whose explanation could not be found within the same paper they were cited in. At the same time, though, the methodology we present is not tied to this specific task, but is instead of relevance for a variety of contexts, and might therefore be used to build a stand-alone system for advanced knowledge extraction and semantic categorization.
  • Keywords
    learning (artificial intelligence); natural language processing; ontologies (artificial intelligence); pattern clustering; text analysis; PRAISED framework; characterized term identification; document clustering; knowledge discovery methodology; knowledge extraction; machine-learning approach; natural language processing pipeline; ontology alignment; paper corpus semantic categorization; pattern-based approach; resolution capability enhancement; semantic relationships; syntactical process; unstructured textual source semantic categorization; user feedback; Context; Natural language processing; Ontologies; Pipelines; Semantics; Tagging; Training;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Signal Image Technology and Internet Based Systems (SITIS), 2012 Eighth International Conference on
  • Conference_Location
    Naples
  • Print_ISBN
    978-1-4673-5152-2
  • Type

    conf

  • DOI
    10.1109/SITIS.2012.140
  • Filename
    6395193