Title :
Delimiting boundaries of a national Web in a globalized world, UAE case study
Author :
BenAbdelkader, Chiraz ; Sanver, Mostafa
Author_Institution :
Sch. of Eng. & Comput. Sci., New York Inst. of Technol., Abu Dhabi, United Arab Emirates
Abstract :
In this paper, we address the problem of delimiting the boundaries of a specific national Web community. We contend that previous simple techniques, mostly based on IP range and language information, are no longer effective. In reality, the Web has undergone a globalization trend, and we can no longer assume a simple one-to-one mapping between where Web content is hosted, the language it is written in, the target community it is intended for, and its geographic location. We propose a two-stage Web page filtering (classification) method for this problem: (1) a pre-crawl filter designed to quickly prune out most of the irrelevant pages without downloading them, and (2) a post-crawl filter that prunes out (most of) the remaining ones via more detailed albeit time-consuming analysis. We discuss the proposed techniques in the context of the UAE national Web, and present results on Web crawl data collected during the period June-July 2010.
Keywords :
Internet; information filtering; UAE; Web page filtering; language information; national Web community; one-to-one mapping; post-crawl filter; pre-crawl filter; time consuming analysis; Communities; Crawlers; IP networks; Web pages; Web server; Graph theory; Hypertext systems; Internet; Web characterization;
Conference_Titel :
GCC Conference and Exhibition (GCC), 2011 IEEE
Conference_Location :
Dubai
Print_ISBN :
978-1-61284-118-2
DOI :
10.1109/IEEEGCC.2011.5752578