Evolving document features for Web document clustering: a feasibility study

Author

Sinka, Mark P. ; Corne, David W.

Author_Institution

Dept. of Comput. Sci., Reading Univ., UK

Volume

1

fYear

2004

fDate

19-23 June 2004

Firstpage

891

Abstract

Document analysis and its associated research underpins Web intelligence and the envisaged ´semantic Web´. A key issue is how to encode a document without losing salient information. Current research almost always uses fixed-length vectors based on word (term) frequency (TF) and/or variants thereof. We explore the question of alternative encodings, and we search for such encodings using an evolutionary algorithm (EA). These alternatives consider a variety of other features that can be extracted from a document, and the EA explores the space of weighted combinations of these. Tests on the BankSearch dataset were able to find encodings which outperformed previous results using TF-based encodings. Among several tentative findings it seems clear that the ideal encoding is highly task-dependent, and we can recommend certain features as useful for specific types of document clustering tasks.

Keywords

Internet; evolutionary computation; pattern clustering; text analysis; BankSearch dataset; TF-based encodings; Web document clustering; Web intelligence; World Wide Web; document analysis; document encoding; document features; evolutionary algorithm; fixed-length vectors; semantic Web; term frequency; Computer science; Encoding; Frequency; Information retrieval; Internet; Search engines; Semantic Web; Space exploration; Taxonomy; Web sites;

fLanguage

English

Publisher

ieee

Conference_Titel

Evolutionary Computation, 2004. CEC2004. Congress on

Print_ISBN

0-7803-8515-2

Type

conf

DOI

10.1109/CEC.2004.1330955

Filename

1330955