DocumentCode :
3693962
Title :
Design of local web content observatory system
Author :
Gashaw Tsegaye;Solomon Atnafu
Author_Institution :
Department of Computer Science, Addis Ababa University, Addis Ababa, Ethiopia
fYear :
2015
Firstpage :
1
Lastpage :
5
Abstract :
The amount of information on the web is growing rapidly. However, considering a particular group or country, it is very difficult to know how much relevant web contents are published and which are in what language and on what specific subject. Knowing the status of local web content of a country or a culture is of critical importance for making a decision on policy and strategy design for the development of the multi-lingual and multi-cultural web. This research work is therefore to design a model for a local web content observatory system that measures the qualitative and quantitative content of different domains. The local web content observatory system consists of six components - the crawler, content extractor, statistical tracker, language identifier, Web document categorizer and report generator. Though the model developed is generic and can be applied to any country or culture, to test and evaluate the system, we have selected all domains hosted under the .et domain. Accordingly about two thousand seed URLs under the .et domain are used and the crawler collected around 263,031 web documents. The accuracy rate measures employed to the language identifier obtained a rate of 98.67%. To demonstrate the effectiveness of the local web content categorizer precision, recall and F-measures test were conducted and an average precision of 91.7%, a recall of 97.2% and an F-measures of 94.25% is obtained for English document and a precision of 91.7%, recall of 87.85% and F-measures of 86.65% obtained for Amharic document. The average accuracy rate of the statistical tracker is 98.72%.
Keywords :
"Crawlers","Observatories","Training","Search engines","Service-oriented architecture","Web pages","Accuracy"
Publisher :
ieee
Conference_Titel :
AFRICON, 2015
Electronic_ISBN :
2153-0033
Type :
conf
DOI :
10.1109/AFRCON.2015.7331964
Filename :
7331964
Link To Document :
بازگشت