مرکز منطقه ای اطلاع رساني علوم و فناوري - A Comparison of Approaches to Chinese Word Segmentation in Hadoop

DocumentCode :

243685

Title :

A Comparison of Approaches to Chinese Word Segmentation in Hadoop

Author :

Zhangang Wang ; Bangjie Meng

Author_Institution :

Sch. of Comput. Sci. & Software, Tianjin Polytech. Univ., Tianjin, China

fYear :

2014

fDate :

14-14 Dec. 2014

Firstpage :

844

Lastpage :

850

Abstract :

Today, we´re surrounded by data especially Chinese information. The exponential growth of data first presented challenges to cutting-edge businesses such as Alibaba, Jingdong, Amazon, and Microsoft. They need to go through terabytes and petabytes of data to figure out which websites were popular, what books were in demand, and what kinds of ads appealed to people. Chinese word segmentation is a computer problem in Chinese information processing, and the Chinese word segmentation algorithm is one of the core, but because of the different characteristics of the environment morpheme in English, making the Chinese must solve word problems. Chinese lexical analysis is the foundation and key Chinese information processing. IKAnalyzer (IK) and ICTCLAS (IC) is a very popular Chinese word segmentation algorithm. At present, these two algorithms in Chinese segmentation play an important role in solving the text data. If the two algorithms are well applied to Hadoop distributed environment, will have better performance. In this paper we compare IK and IC algorithm performance by the theory and experiments. This paper reports the experimental work on the mass Chinese text segmentation problem and its optimal solution using Hadoop cluster, Hadoop Distributed File System (HDFS) for storage and using parallel processing to process large data sets using Map Reduce programming framework. We have done prototype implementation of Hadoop cluster, HDFS storage and Map Reduce framework for processing large text data sets by considering prototype of big data application scenarios. The results obtained from various experiments indicate favorable results of above IC and IK algorithm to address mass Chinese text segmentation problem. (Addressing Big Data Problem Using Hadoop and Map Reduce). Furthermore, we evaluate both kinds of segmentation in terms of performance. Although the process to load data into and tune the execution of parallel distributed system took much longer than the centr- lized system, the observed performance of these word segmentation algorithms were strikingly better.

Keywords :

Big Data; Web sites; distributed databases; natural language processing; parallel processing; text analysis; word processing; Big Data problem; Chinese information processing; Chinese lexical analysis; Chinese word segmentation algorithm; English; HDFS storage; Hadoop cluster; Hadoop distributed environment; Hadoop distributed file system; IC algorithm; ICTCLAS; IK algorithm; IKAnalyzer; Map Reduce framework; Map Reduce programming framework; Web sites; big data application scenario; centralized system; environment morpheme; exponential growth; mass Chinese text segmentation problem; optimal solution; parallel distributed system; parallel processing; text data; Accuracy; Algorithm design and analysis; Computers; Hidden Markov models; Integrated circuits; Search engines; Sorting; Chinese word segmentation; HDFS; Hadoop; ICTCLAS; IKAnalyzer; MapReduce; inverted descending order;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Data Mining Workshop (ICDMW), 2014 IEEE International Conference on

Conference_Location :

Shenzhen

Print_ISBN :

978-1-4799-4275-6

Type :

conf

DOI :

10.1109/ICDMW.2014.43

Filename :

7022683

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=243685