مرکز منطقه ای اطلاع رساني علوم و فناوري - NLP based intelligent news search engine using information extraction from e-newspapers

DocumentCode :

3579296

Title :

NLP based intelligent news search engine using information extraction from e-newspapers

Author :

Kanakaraj, Monisha ; Kamath, S.Sowmya

Author_Institution :

Department of Information Technology, National Institute of Technology Karnataka, Surathkal, Mangalore. India

fYear :

2014

Firstpage :

Lastpage :

Abstract :

Extracting text information from a web news page is a challenging task as most of the E-News content is provided with support from backend Content Management Systems (CMSs). In this paper, we present a personalized news search engine that focuses on building a repository of news articles by applying efficient extraction of text information from a web news page from varied e-news portals. The system is based on the concept of Document Object Model(DOM) tree manipulation for extracting text and modifying the web page structure to exclude irrelevant content like ads and user comments. We also use WordNet, a thesaurus of English language based on psycholinguist studies for matching the extracted content semantically to the title of the web page. TF-IDF (Term Frequency Inverse Document Frequency) is used for identifying the web page blocks carrying information relevant to the pages title. In addition to the extraction of information, functionalities to gather related information from different web news papers and to summarize the gathered information based on user preferences have also been included. We observed that the system was able to achieve good recall and high precision for both generalized and specific queries.

Keywords :

Data mining; HTML; Noise; Search engines; Semantics; Web pages; NLP; Summary generation; Text Extraction; information retrieval; search engine;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Computational Intelligence and Computing Research (ICCIC), 2014 IEEE International Conference on

Print_ISBN :

978-1-4799-3974-9

Type :

conf

DOI :

10.1109/ICCIC.2014.7238500

Filename :

7238500

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3579296