Title :
Evolving document features for Web document clustering: a feasibility study
Author :
Sinka, Mark P. ; Corne, David W.
Author_Institution :
Dept. of Comput. Sci., Reading Univ., UK
Abstract :
Document analysis and its associated research underpins Web intelligence and the envisaged ´semantic Web´. A key issue is how to encode a document without losing salient information. Current research almost always uses fixed-length vectors based on word (term) frequency (TF) and/or variants thereof. We explore the question of alternative encodings, and we search for such encodings using an evolutionary algorithm (EA). These alternatives consider a variety of other features that can be extracted from a document, and the EA explores the space of weighted combinations of these. Tests on the BankSearch dataset were able to find encodings which outperformed previous results using TF-based encodings. Among several tentative findings it seems clear that the ideal encoding is highly task-dependent, and we can recommend certain features as useful for specific types of document clustering tasks.
Keywords :
Internet; evolutionary computation; pattern clustering; text analysis; BankSearch dataset; TF-based encodings; Web document clustering; Web intelligence; World Wide Web; document analysis; document encoding; document features; evolutionary algorithm; fixed-length vectors; semantic Web; term frequency; Computer science; Encoding; Frequency; Information retrieval; Internet; Search engines; Semantic Web; Space exploration; Taxonomy; Web sites;
Conference_Titel :
Evolutionary Computation, 2004. CEC2004. Congress on
Print_ISBN :
0-7803-8515-2
DOI :
10.1109/CEC.2004.1330955