DocumentCode
419085
Title
Evolving document features for Web document clustering: a feasibility study
Author
Sinka, Mark P. ; Corne, David W.
Author_Institution
Dept. of Comput. Sci., Reading Univ., UK
Volume
1
fYear
2004
fDate
19-23 June 2004
Firstpage
891
Abstract
Document analysis and its associated research underpins Web intelligence and the envisaged ´semantic Web´. A key issue is how to encode a document without losing salient information. Current research almost always uses fixed-length vectors based on word (term) frequency (TF) and/or variants thereof. We explore the question of alternative encodings, and we search for such encodings using an evolutionary algorithm (EA). These alternatives consider a variety of other features that can be extracted from a document, and the EA explores the space of weighted combinations of these. Tests on the BankSearch dataset were able to find encodings which outperformed previous results using TF-based encodings. Among several tentative findings it seems clear that the ideal encoding is highly task-dependent, and we can recommend certain features as useful for specific types of document clustering tasks.
Keywords
Internet; evolutionary computation; pattern clustering; text analysis; BankSearch dataset; TF-based encodings; Web document clustering; Web intelligence; World Wide Web; document analysis; document encoding; document features; evolutionary algorithm; fixed-length vectors; semantic Web; term frequency; Computer science; Encoding; Frequency; Information retrieval; Internet; Search engines; Semantic Web; Space exploration; Taxonomy; Web sites;
fLanguage
English
Publisher
ieee
Conference_Titel
Evolutionary Computation, 2004. CEC2004. Congress on
Print_ISBN
0-7803-8515-2
Type
conf
DOI
10.1109/CEC.2004.1330955
Filename
1330955
Link To Document