DocumentCode :
2534522
Title :
Categorizing and extracting information from multilingual HTML documents
Author :
Lim, SeungJin ; Ng, Yiu-Kai
Author_Institution :
Dept. of Comput. Sci., Utah State Univ., Logan, UT, USA
fYear :
2005
fDate :
25-27 July 2005
Firstpage :
415
Lastpage :
422
Abstract :
The amount of online information written in different natural languages and the number of non-English speaking Internet users have been increasing tremendously during the past decade. In order to provide high-performance access of multilingual information on the Internet, we have developed a data analysis and querying system (DatAQs) that: (i) analyzes, identifies, and categorizes languages used in HTML documents; (ii) extracts information from HTML documents of interest written in different languages; (iii) allows the user to submit queries for retrieving extracted information in the same natural language provided by the query engine of DatAQs using a menu-driven user interface; and (iv) processes the user´s queries (as Boolean expressions) to generate the results. DatAQs extracts information from HTML documents that belong to various data-rich, narrow-in-breadth application domains, such as car ads, house rentals, job ads, stocks, university catalogs, etc. The average F-measure on identifying HTML documents written in a particular natural language correctly is 89%, whereas the F-measure on categorizing HTML documents belonged to the car-ads application domain is 94%.
Keywords :
Internet; data analysis; hypermedia markup languages; information retrieval; natural languages; Boolean expressions; DatAQs; HTML document identification; data analysis; information categorization; information extraction; information retrieval; menu-driven user interface; multilingual HTML documents; multilingual information; narrow-in-breadth application domains; natural languages; nonEnglish speaking Internet users; online information; querying system; Catalogs; Data analysis; Data mining; HTML; Information analysis; Information retrieval; Internet; Natural languages; Search engines; User interfaces;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Database Engineering and Application Symposium, 2005. IDEAS 2005. 9th International
ISSN :
1098-8068
Print_ISBN :
0-7695-2404-4
Type :
conf
DOI :
10.1109/IDEAS.2005.15
Filename :
1540932
Link To Document :
بازگشت