Title :
A parallel Probabilistic Latent Semantic Analysis method on MapReduce platform
Author :
Zhao Liang ; Wenye Li ; Yuxi Li
Author_Institution :
Sch. of Comput. Sci. & Eng., Univ. of Electron. Sci. & Technol. of China, Chengdu, China
Abstract :
Probabilistic Latent Semantic Analysis (PLSA) is a powerful statistical technique to analyze relation between co-occurrence data, and has wide usages in automated information processing tasks. However it involves non-trivial computation and is often difficult and time-consuming to train when the dataset is big. MapReduce is a computing framework designed by Google which aims to provide a distributed solution to practically large-scale data analysis tasks using clusters of computers. In this work, we addressed the scalability problem of PLSA by proposing and implementing a parallel method to train PLSA under the MapReduce computing framework. The empirical experiment results show that when the training dataset is large, learning the probability distributions of PLSA model in a parallel way can achieve almost linear speedups and thus provides a practical solution to large-scale data analysis applications.
Keywords :
data analysis; learning (artificial intelligence); parallel processing; statistical distributions; workstation clusters; MapReduce computing framework; MapReduce platform; PLSA; automated information processing tasks; co-occurrence data relation analysis; computer cluster; large-scale data analysis task; learning; parallel probabilistic latent semantic analysis method; probability distributions; scalability problem; statistical technique; training dataset; Algorithm design and analysis; Computational modeling; Computers; Probabilistic logic; Scalability; Semantics; Training; EM algorithm; MapReduce; Parallelism; Probabilistic Latent Semantic Analysis;
Conference_Titel :
Information and Automation (ICIA), 2013 IEEE International Conference on
Conference_Location :
Yinchuan
DOI :
10.1109/ICInfA.2013.6720444