Title :
Real-time unsupervised classification of web documents
Author :
Sigogne, Anthony ; Constant, Matthieu
Author_Institution :
Lab. d´´Inf. Gaspard-Monge, Univ. Paris-Est, Marne-la-Vallee, France
Abstract :
This paper addresses the problem of clustering dynamic collections of web documents. We show an iterative algorithm based on a fine-grained keyword extraction (simple, compound words and proper nouns). Each new document inserted in the collection is either assigned to an existing class containing documents of the same topic, or assigned to a new class. After each step, when necessary, classes are refined using statistical techniques. The implementation of this algorithm was successfully integrated in an application used for Information Intelligence.
Keywords :
Internet; algorithm theory; document handling; pattern classification; real-time systems; Web documents; algorithm implementation; class containing documents; dynamic collections web documents; fine grained keyword extraction; information Intelligence; iterative algorithm based; real time classification; statistical techniques; Classification algorithms; Clustering algorithms; Computer science; Data mining; Frequency; Information technology; Iterative algorithms; Large-scale systems; Support vector machines;
Conference_Titel :
Computer Science and Information Technology, 2009. IMCSIT '09. International Multiconference on
Conference_Location :
Mragowo
Print_ISBN :
978-1-4244-5314-6
DOI :
10.1109/IMCSIT.2009.5352714