DocumentCode :
3107214
Title :
Specific Web Spider Design for the Extraction of Unknown Chinese Words from BBS Corpus
Author :
Xiong, Hai-ling ; Du, Jing
Author_Institution :
Coll. of Comput. & Inf., Sci. Southwest Univ., Chongqing, China
fYear :
2009
fDate :
13-14 Dec. 2009
Firstpage :
499
Lastpage :
502
Abstract :
Aiming at the low efficiency of unknown words segmentation of Chinese words, this paper presented an improved design of Web spider that extracted texts from TianYa BBS in order to construct a better corpus. Then generate unknown words by extracting words from the corpus with a new function which was firstly constructed by mutual Information function and duplicated combination frequency function. Experiments showed that the improved method was more efficient.
Keywords :
Internet; data mining; natural language processing; BBS corpus; Chinese word extraction; Web spider design; duplicated combination frequency function; mutual Information function; words segmentation; Conference management; Data mining; Design engineering; Engineering management; Frequency; Information management; Information technology; Mutual information; Technology management; Uniform resource locators; BBS; Chinese word segmentation; corpus; unknown word; web spider;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Future Information Technology and Management Engineering, 2009. FITME '09. Second International Conference on
Conference_Location :
Sanya
Print_ISBN :
978-1-4244-5339-9
Type :
conf
DOI :
10.1109/FITME.2009.130
Filename :
5381036
Link To Document :
بازگشت