• DocumentCode
    188222
  • Title

    Big Data Processing with Probabilistic Latent Semantic Analysis on MapReduce

  • Author

    Yong Zhao ; Yao Chen ; Zhao Liang ; Shuangshuang Yuan ; Youfu Li

  • Author_Institution
    Sch. of Comput. Sci. & Eng., Univ. of Electron. Sci. & Technol. of China, Chengdu, China
  • fYear
    2014
  • fDate
    13-15 Oct. 2014
  • Firstpage
    162
  • Lastpage
    166
  • Abstract
    Probabilistic Latent Semantic Analysis (PLSA) is a powerful statistical technique to analyze co-occurrence data, it has wide usage in information processing, ranging from information retrieval, information filtering, text processing automation, to natural language processing, and related areas. However, it has very high time and space complexity to train PLSA model on big data. Researchers have been trying to solve this problem using parallel means. But their approaches only partially reduce the time complexity, the main memory in the compute process still needs to load a large amount of data. In order to solve the scalability problem of data, a parallel method to train PLSA is proposed by adapting the traditional EM algorithm into MapReduce a computing framework for processing vast amounts of data in-parallel on clusters. In this way, the main memory in each computer just needs to load part of the dataset. This method can reduce time and space complexity simultaneously. Results show that this method can deal with large datasets efficiently.
  • Keywords
    Big Data; computational complexity; expectation-maximisation algorithm; parallel programming; probability; Big Data processing; EM algorithm; MapReduce; PLSA model training; co-occurrence data analysis; computing framework; data scalability problem; information filtering; information processing; information retrieval; main memory; natural language processing; parallel method; probabilistic latent semantic analysis; space complexity; statistical technique; text processing automation; time complexity; Computational modeling; Information retrieval; Load modeling; Mathematical model; Probabilistic logic; Semantics; Training; MapReduce; Parallelism; Probabilistic Latent Semantic Analysis; Scalablity;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC), 2014 International Conference on
  • Conference_Location
    Shanghai
  • Print_ISBN
    978-1-4799-6235-8
  • Type

    conf

  • DOI
    10.1109/CyberC.2014.37
  • Filename
    6984300