DocumentCode
492502
Title
A Novel POS-Based Approach to Chinese News Topic Extraction from Internet
Author
Zhao, Xujian ; Jin, Peiquan ; Yue, Lihua
Author_Institution
Dept. of Comput. Sci. & Technol., Univ. of Sci. & Technol. of China
Volume
2
fYear
2008
fDate
13-15 Dec. 2008
Firstpage
39
Lastpage
42
Abstract
News topic extraction is very important for news search engine. The traditional methods are based on pattern matching and linguistic analysis, which mainly depend on the measurement of feature similarity. But due to two reasons, those methods are basically inefficient to process Chinese news topic extraction from Internet. The first reason is the difficulty of Natural Language Processing (NLP) for Chinese, and the other is the diversity and fast update speed of Internet news. At the present, some works utilizing news special structure (e.g. title) for Chinese news topic are presented. However, two problems still remain unsolved so far, which are (1) missing of some news topic and (2) irregular topic words produced. Aiming to solve these two problems, we propose a POS-based approach to news topic extraction. We first segment words and tag POS for news title, and then eliminate segmentation errors according to POS information and position relation. After that, topic words are associated and combined into bigger ones, and different topic weights are assigned to those bigger words. We conduct an experiment on 600 Chinese news Web pages to demonstrate our new approach. The experimental results show that our approach has a higher recall and precision rate of news topic extraction and reduces irregular topic words obviously.
Keywords
Internet; information resources; information retrieval; natural language processing; search engines; word processing; Chinese news topic extraction; Internet; natural language processing; news search engine; novel POS-approach; topic word segmentation; Computer science; Conferences; Data mining; IP networks; Internet; Pattern analysis; Pattern matching; Search engines; Thesauri; Web pages;
fLanguage
English
Publisher
ieee
Conference_Titel
Future Generation Communication and Networking Symposia, 2008. FGCNS '08. Second International Conference on
Conference_Location
Sanya
Print_ISBN
978-1-4244-3430-5
Electronic_ISBN
978-0-7695-3546-3
Type
conf
DOI
10.1109/FGCNS.2008.71
Filename
4813517
Link To Document