Title :
Web content management by self-organization
Author :
Freeman, Richard T. ; Yin, Hujun
Author_Institution :
Sch. of Electr. & Electron. Eng., Univ. of Manchester, UK
Abstract :
We present a new method for content management and knowledge discovery using a topology-preserving neural network. The method, termed topological organization of content (TOC), can generate a taxonomy of topics from a set of unannotated, unstructured documents. The TOC consists of a hierarchy of self-organizing growing chains (GCs), each of which can develop independently in terms of size and topics. The dynamic development process is validated continuously using a proposed entropy-based Bayesian information criterion (BIC). Each chain meeting the criterion spans child chains, with reduced vocabularies and increased specializations. This results in a topological tree hierarchy, which can be browsed like a table of contents directory or web portal. A brief review is given on existing methods for document clustering and organization, and clustering validation measures. The proposed approach has been tested and compared with several existing methods on real world web page datasets. The results have clearly demonstrated the advantages and efficiency in content organization of the proposed method in terms of computational cost and representation. The TOC can be easily adapted for large-scale applications. The topology provides a unique, additional feature for retrieving related topics and confining the search space.
Keywords :
Internet; classification; content management; data mining; document handling; entropy; information retrieval; pattern clustering; self-organising feature maps; Web content management; clustering validation; document categorization; document clustering; document organization; dynamic development process; entropy-based Bayesian information criterion; information retrieval; knowledge discovery; self-organizing growing chains; self-organizing maps; topic taxonomy; topological content organization; topological tree hierarchy; topology-preserving neural network; unannotated unstructured documents; Bayesian methods; Computational efficiency; Content management; Large-scale systems; Neural networks; Portals; Taxonomy; Testing; Vocabulary; Web pages; Bayesian information criterion (BIC); content management; document categorization; hierarchical clustering; information retrieval (IR); self-organizing maps (SOMs); taxonomy generation; topic hierarchy; topological tree structure; Abstracting and Indexing as Topic; Artificial Intelligence; Database Management Systems; Documentation; Information Storage and Retrieval; Internet; Natural Language Processing; Pattern Recognition, Automated; Signal Processing, Computer-Assisted;
Journal_Title :
Neural Networks, IEEE Transactions on
DOI :
10.1109/TNN.2005.853415