DocumentCode :
26396
Title :
On the Use of Side Information for Mining Text Data
Author :
Aggarwal, Charu C. ; Yuchen Zhao ; Yu, Philip S.
Author_Institution :
Dept. of Comput. Sci., IBM T. J. Watson Res. Center, Yorktown Heights, NY, USA
Volume :
26
Issue :
6
fYear :
2014
fDate :
Jun-14
Firstpage :
1415
Lastpage :
1429
Abstract :
In many text mining applications, side-information is available along with the text documents. Such side-information may be of different kinds, such as document provenance information, the links in the document, user-access behavior from web logs, or other non-textual attributes which are embedded into the text document. Such attributes may contain a tremendous amount of information for clustering purposes. However, the relative importance of this side-information may be difficult to estimate, especially when some of the information is noisy. In such cases, it can be risky to incorporate side-information into the mining process, because it can either improve the quality of the representation for the mining process, or can add noise to the process. Therefore, we need a principled way to perform the mining process, so as to maximize the advantages from using this side information. In this paper, we design an algorithm which combines classical partitioning algorithms with probabilistic models in order to create an effective clustering approach. We then show how to extend the approach to the classification problem. We present experimental results on a number of real data sets in order to illustrate the advantages of using such an approach.
Keywords :
data mining; pattern clustering; probability; text analysis; Web logs; classical partitioning algorithms; clustering approach; document provenance information; nontextual attributes; probabilistic models; real data sets; side information; text data mining; text documents; user-access behavior; Approximation methods; Clustering algorithms; Coherence; Database systems; Noise measurement; Partitioning algorithms; Probabilistic logic; Data mining; clustering; text mining;
fLanguage :
English
Journal_Title :
Knowledge and Data Engineering, IEEE Transactions on
Publisher :
ieee
ISSN :
1041-4347
Type :
jour
DOI :
10.1109/TKDE.2012.148
Filename :
6247433
Link To Document :
بازگشت