DocumentCode :
243685
Title :
A Comparison of Approaches to Chinese Word Segmentation in Hadoop
Author :
Zhangang Wang ; Bangjie Meng
Author_Institution :
Sch. of Comput. Sci. & Software, Tianjin Polytech. Univ., Tianjin, China
fYear :
2014
fDate :
14-14 Dec. 2014
Firstpage :
844
Lastpage :
850
Abstract :
Today, we´re surrounded by data especially Chinese information. The exponential growth of data first presented challenges to cutting-edge businesses such as Alibaba, Jingdong, Amazon, and Microsoft. They need to go through terabytes and petabytes of data to figure out which websites were popular, what books were in demand, and what kinds of ads appealed to people. Chinese word segmentation is a computer problem in Chinese information processing, and the Chinese word segmentation algorithm is one of the core, but because of the different characteristics of the environment morpheme in English, making the Chinese must solve word problems. Chinese lexical analysis is the foundation and key Chinese information processing. IKAnalyzer (IK) and ICTCLAS (IC) is a very popular Chinese word segmentation algorithm. At present, these two algorithms in Chinese segmentation play an important role in solving the text data. If the two algorithms are well applied to Hadoop distributed environment, will have better performance. In this paper we compare IK and IC algorithm performance by the theory and experiments. This paper reports the experimental work on the mass Chinese text segmentation problem and its optimal solution using Hadoop cluster, Hadoop Distributed File System (HDFS) for storage and using parallel processing to process large data sets using Map Reduce programming framework. We have done prototype implementation of Hadoop cluster, HDFS storage and Map Reduce framework for processing large text data sets by considering prototype of big data application scenarios. The results obtained from various experiments indicate favorable results of above IC and IK algorithm to address mass Chinese text segmentation problem. (Addressing Big Data Problem Using Hadoop and Map Reduce). Furthermore, we evaluate both kinds of segmentation in terms of performance. Although the process to load data into and tune the execution of parallel distributed system took much longer than the centr- lized system, the observed performance of these word segmentation algorithms were strikingly better.
Keywords :
Big Data; Web sites; distributed databases; natural language processing; parallel processing; text analysis; word processing; Big Data problem; Chinese information processing; Chinese lexical analysis; Chinese word segmentation algorithm; English; HDFS storage; Hadoop cluster; Hadoop distributed environment; Hadoop distributed file system; IC algorithm; ICTCLAS; IK algorithm; IKAnalyzer; Map Reduce framework; Map Reduce programming framework; Web sites; big data application scenario; centralized system; environment morpheme; exponential growth; mass Chinese text segmentation problem; optimal solution; parallel distributed system; parallel processing; text data; Accuracy; Algorithm design and analysis; Computers; Hidden Markov models; Integrated circuits; Search engines; Sorting; Chinese word segmentation; HDFS; Hadoop; ICTCLAS; IKAnalyzer; MapReduce; inverted descending order;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Mining Workshop (ICDMW), 2014 IEEE International Conference on
Conference_Location :
Shenzhen
Print_ISBN :
978-1-4799-4275-6
Type :
conf
DOI :
10.1109/ICDMW.2014.43
Filename :
7022683
Link To Document :
بازگشت