DocumentCode
3194371
Title
Document Clustering Method Based on Visual Features
Author
Liu, Yucong ; Zhang, Bofeng ; Xing, Kun ; Zhou, Bo
Author_Institution
Sch. of Comput. Eng. & Sci., Shanghai Univ., Shanghai, China
fYear
2011
fDate
19-22 Oct. 2011
Firstpage
458
Lastpage
462
Abstract
There are two important problems worth conducting research in the fields of personalized information services based on user model. One is how to get and describe user personal information, i.e. building user model, the other is how to organize the information resources, i.e. document clustering. It is difficult to find out the desired information without a proper clustering algorithm. Several new ideas have been proposed in recent years. But most of them only took into account the text information, but some other useful information may have more contributions for documents clustering, such as the text size, font and other appearance characteristics, so called visual features. This paper proposes a method to cluster the scientific documents based on visual features, so called VF-Clustering algorithm. Five kinds of visual features of documents are de-fined, including body, abstract, subtitle, keyword and title. The thought of crossover and mutation in genetic algorithm is used to adjust the value of k and cluster center in the k-means algorithm dynamically. Experimental result supports our approach as better concept. In the five visual features, the clustering accuracy and steadiness of subtitle are only less than that of body, but the efficiency is much better than body because the subtitle size is much less than body size. The accuracy of clustering by combining subtitle and keyword is better than each of them individually, but is a little less than that by combining subtitle, keyword and body. If the efficiency is an essential factor, clustering by combining subtitle and keyword can be an optimal choice.
Keywords
genetic algorithms; information resources; pattern clustering; text analysis; abstract; appearance characteristics; body; crossover; document clustering method; genetic algorithm; information resources; k-means algorithm; keyword; mutation; personalized information services; scientific documents; subtitle; text font; text information; text size; title; user personal information; visual features; Algorithm design and analysis; Clustering algorithms; Feature extraction; Genetic algorithms; Heuristic algorithms; Vectors; Visualization; document clustering; genetic algorithm; k-means; visual features;
fLanguage
English
Publisher
ieee
Conference_Titel
Internet of Things (iThings/CPSCom), 2011 International Conference on and 4th International Conference on Cyber, Physical and Social Computing
Conference_Location
Dalian
Print_ISBN
978-1-4577-1976-9
Type
conf
DOI
10.1109/iThings/CPSCom.2011.69
Filename
6142293
Link To Document