• DocumentCode
    3204889
  • Title

    A method of describing document contents through topic selection

  • Author

    Gelbukh, A. ; Sidorov, G. ; Guzmán-Arenas, A.

  • Author_Institution
    Center for Comput. Res., Nat. Polytech. Inst., Mexico City, Mexico
  • fYear
    1999
  • fDate
    1999
  • Firstpage
    73
  • Lastpage
    80
  • Abstract
    Given a large hierarchical dictionary of concepts, the task of selection of the concepts that describe the contents of a given document is considered. The problem consists in proper handling of the top-level concepts in the hierarchy. As a representation of the document, a histogram of the topics with their respective contribution in the document is used. The contribution is determined by comparison of the document with the “ideal” document for each topic in the dictionary. The “ideal” document for a concept is one that contains only the keywords belonging to this concept, in proportion to their occurrences in the training corpus. A fast algorithm of comparison for some types of metrics is proposed. The application of the method in a system classifier is discussed
  • Keywords
    classification; dictionaries; document handling; learning (artificial intelligence); concept dictionary; document content description; fast algorithm; histogram; ideal document; keywords; large hierarchical dictionary; system classifier; top-level concepts; topic selection; training corpus; Cities and towns; Data mining; Dictionaries; Histograms; Identity-based encryption; Internet; Laboratories; Natural languages; Nominations and elections; Text mining;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    String Processing and Information Retrieval Symposium, 1999 and International Workshop on Groupware
  • Conference_Location
    Cancun
  • Print_ISBN
    0-7695-0268-7
  • Type

    conf

  • DOI
    10.1109/SPIRE.1999.796580
  • Filename
    796580