مرکز منطقه ای اطلاع رساني علوم و فناوري - HisTrace: A system for mining on news-related articles instead of web pages

DocumentCode :

2640398

Title :

HisTrace: A system for mining on news-related articles instead of web pages

Author :

Huang, Lianen ; Li, Xiaoming

Author_Institution :

Shenzhen Grad. Sch., Internet Res. & Eng. Center, Peking Univ., Shenzhen, China

fYear :

2010

fDate :

16-17 Aug. 2010

Firstpage :

Lastpage :

Abstract :

The Web is now playing an important part in people´s real-life activities. Scientists of not only computer science but also sociology and economics might be interested in mining on information directly related to real-life events, or news-related information on the Web. In this paper we propose a system to enable mining on news-related articles instead of raw web pages. There are functionally two tasks in our system: 1) mining for news-related articles and 2) duplicate elimination. For the first task, a novel approach for determining titles, contents and publication-times of news-related articles is presented. Anchor texts are firstly used to extract titles from HTML bodies and then contents are extracted right after titles. After that, crawl-times and are used to initially compute publication-times for all articles. At last, times extracted from HTML bodies, URLs and anchor texts are used to determine precise publication-times for possible articles. For the second task, a duplicate detection algorithm for news-related articles is described which is base on LCS (longest common subsequence) and achieves both high precision and high recall. The framework of this algorithm has been presented as a general-purpose algorithm for web pages in a previously published paper. In this paper we explain why this algorithm is particularly suitable for news-related articles and present corresponding implementation details. Evaluations have been conducted which show the effectiveness of our approaches.

Keywords :

Internet; data mining; text analysis; HTML bodies; HisTrace; URLs; Web pages; anchor texts; content determination; duplicate detection algorithm; duplicate elimination; economics; longest common subsequence; news-related article mining; publication-time determination; sociology; time extraction; title determination; title extraction; Algorithm design and analysis; Crawlers; Data mining; Feature extraction; HTML; Search engines; Web pages; Duplicate Detection; News-related Articles; Publication Time; Web Mining;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Web Society (SWS), 2010 IEEE 2nd Symposium on

Conference_Location :

Beijing

Print_ISBN :

978-1-4244-6356-5

Type :

conf

DOI :

10.1109/SWS.2010.5607481

Filename :

5607481

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2640398