DocumentCode :
3088690
Title :
Web image size prediction for efficient focused image crawling
Author :
Andreadou, Katerina ; Papadopoulos, Symeon ; Kompatsiaris, Yiannis
Author_Institution :
Inf. Technol. Inst. (ITI), Centre for Res. & Technol. Hellas (CERTH), Thessaloniki, Greece
fYear :
2015
fDate :
10-12 June 2015
Firstpage :
1
Lastpage :
6
Abstract :
In the context of using Web image content for analysis and retrieval, it is typically necessary to perform large-scale image crawling. A serious bottleneck in such set-ups pertains to the fetching of image content, since for each web page a large number of HTTP requests need to be issued to download all included image elements. In practice, however, only the relatively big images (e.g., larger than 400 pixels in width and height) are potentially of interest, since most of the smaller ones are irrelevant to the main subject or correspond to decorative elements (e.g., icons, buttons). Given that there is often no dimension information in the HTML img tag of images, to filter out small images, an image crawler would still need to issue a GET request and download the respective files before deciding whether to index them. To address this limitation, in this paper, we explore the challenge of predicting the size of images on the Web based only on their URL and information extracted from the surrounding HTML code. We present two different methodologies: The first one is based on a common text classification approach using the n-grams or tokens of the image URLs and the second one relies on the HTML elements surrounding the image. Eventually, we combine these two techniques, and achieve considerable improvement in terms of accuracy, leading to a highly effective filtering component that can significantly improve the speed and efficiency of the image crawler.
Keywords :
Internet; Web sites; hypermedia markup languages; image retrieval; text analysis; HTML code; HTML element; HTTP request; Web image content; Web image size prediction; Web page; decorative element; dimension information; filtering component; image URL; image crawler; image element; information extraction; large-scale image crawling; text classification; Data mining; Feature extraction; HTML; Training; Uniform resource locators; Vegetation; Web pages;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Content-Based Multimedia Indexing (CBMI), 2015 13th International Workshop on
Conference_Location :
Prague
Type :
conf
DOI :
10.1109/CBMI.2015.7153609
Filename :
7153609
Link To Document :
بازگشت