• DocumentCode
    419085
  • Title

    Evolving document features for Web document clustering: a feasibility study

  • Author

    Sinka, Mark P. ; Corne, David W.

  • Author_Institution
    Dept. of Comput. Sci., Reading Univ., UK
  • Volume
    1
  • fYear
    2004
  • fDate
    19-23 June 2004
  • Firstpage
    891
  • Abstract
    Document analysis and its associated research underpins Web intelligence and the envisaged ´semantic Web´. A key issue is how to encode a document without losing salient information. Current research almost always uses fixed-length vectors based on word (term) frequency (TF) and/or variants thereof. We explore the question of alternative encodings, and we search for such encodings using an evolutionary algorithm (EA). These alternatives consider a variety of other features that can be extracted from a document, and the EA explores the space of weighted combinations of these. Tests on the BankSearch dataset were able to find encodings which outperformed previous results using TF-based encodings. Among several tentative findings it seems clear that the ideal encoding is highly task-dependent, and we can recommend certain features as useful for specific types of document clustering tasks.
  • Keywords
    Internet; evolutionary computation; pattern clustering; text analysis; BankSearch dataset; TF-based encodings; Web document clustering; Web intelligence; World Wide Web; document analysis; document encoding; document features; evolutionary algorithm; fixed-length vectors; semantic Web; term frequency; Computer science; Encoding; Frequency; Information retrieval; Internet; Search engines; Semantic Web; Space exploration; Taxonomy; Web sites;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Evolutionary Computation, 2004. CEC2004. Congress on
  • Print_ISBN
    0-7803-8515-2
  • Type

    conf

  • DOI
    10.1109/CEC.2004.1330955
  • Filename
    1330955