Title of article :
GeoSegmenter: A statistically learned Chinese word segmenter for the geoscience domain
Author/Authors :
Huang، نويسنده , , Lan and Du، نويسنده , , Youfu and Chen، نويسنده , , Gongyang، نويسنده ,
Issue Information :
روزنامه با شماره پیاپی سال 2015
Abstract :
Unlike English, the Chinese language has no space between words. Segmenting texts into words, known as the Chinese word segmentation (CWS) problem, thus becomes a fundamental issue for processing Chinese documents and the first step in many text mining applications, including information retrieval, machine translation and knowledge acquisition. However, for the geoscience subject domain, the CWS problem remains unsolved. Although a generic segmenter can be applied to process geoscience documents, they lack the domain specific knowledge and consequently their segmentation accuracy drops dramatically.
otivated us to develop a segmenter specifically for the geoscience subject domain: the GeoSegmenter. We first proposed a generic two-step framework for domain specific CWS. Following this framework, we built GeoSegmenter using conditional random fields, a principled statistical framework for sequence learning. Specifically, GeoSegmenter first identifies general terms by using a generic baseline segmenter. Then it recognises geoscience terms by learning and applying a model that can transform the initial segmentation into the goal segmentation. Empirical experimental results on geoscience documents and benchmark datasets showed that GeoSegmenter could effectively recognise both geoscience terms and general terms.
Keywords :
Natural language processing , conditional random fields , Chinese word segmentation
Journal title :
Computers & Geosciences
Journal title :
Computers & Geosciences