• DocumentCode
    2334433
  • Title

    The DIAsDEM framework for converting domain-specific texts into XML documents with data mining techniques

  • Author

    Graubitz, Henner ; Spiliopoulou, Myra ; Winkler, Karsten

  • Author_Institution
    Dept. of E-Business, Leipzig Graduate Sch. of Manage., Germany
  • fYear
    2001
  • fDate
    2001
  • Firstpage
    171
  • Lastpage
    178
  • Abstract
    Modern organizations are accumulating huge volumes of textual documents. To turn archives into valuable knowledge sources, textual content must become explicit and able to be queried. Semantic tagging with markup languages such as XML satisfies both requirements. We thus introduce the DIAsDEM* framework for extracting semantics from structural text units (e.g., sentences), assigning XML tags to them and deriving a flat XML DTD for the archive. DIAsDEM focuses on archives characterized by a peculiar terminology and by an implicit structure such as court filings and company reports. In the knowledge discovery phase, text units are iteratively clustered by similarity of their content. Each iteration outputs clusters satisfying a set of quality criteria. Text units contained in these clusters are tagged with semiautomatically determined cluster labels and XML tags respectively. Additionally, extracted named entities (e.g., persons) serve as attributes of XML tags. We apply the framework in a case study on the German Commercial Register
  • Keywords
    data mining; data warehouses; hypermedia markup languages; DIAsDEM framework; German Commercial Register; XML documents; archive; company reports; content similarity; court filings; data mining; domain-specific text conversion; flat XML DTD; iterative clustering; knowledge discovery; markup languages; quality criteria; semantic tagging; semiautomatically determined cluster labels; structural text units; terminology; Data mining; Knowledge management; Markup languages; Project management; Relational databases; Tagging; Terminology; Text mining; Vocabulary; XML;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Mining, 2001. ICDM 2001, Proceedings IEEE International Conference on
  • Conference_Location
    San Jose, CA
  • Print_ISBN
    0-7695-1119-8
  • Type

    conf

  • DOI
    10.1109/ICDM.2001.989515
  • Filename
    989515