Author/Authors :
Erik Thorlund Jepsen، نويسنده , , Piet Seiden، نويسنده , , Peter Ingwersen، نويسنده , , and Lennart Bj?rneborn، نويسنده , , Pia Borlund، نويسنده ,
Abstract :
Because of the increasing presence of scientific publications
on theWeb, combined with the existing difficulties in
easily verifying and retrieving these publications, research
on techniques and methods for retrieval of scientificWeb
publications is called for. In this article, we report
on the initial steps taken toward the construction of a test
collection of scientificWeb publications within the subject
domain of plant biology. The steps reported are those of
data gathering and data analysis aiming at identifying
characteristics of scientific Web publications. The data
used in this article were generated based on specifically
selected domain topics that are searched for in three publicly
accessible search engines (Google, AllTheWeb, and
AltaVista).Asample of the retrieved hits was analyzed with
regard to how various publication attributes correlated
with the scientific quality of the content and whether this
information could be employed to harvest, filter, and rank
Web publications. The attributes analyzed were inlinks,
outlinks, bibliographic references, file format, language,
search engine overlap, structural position (according to
site structure), and the occurrence of various types of
metadata. As could be expected, the ranked output differs
between the three search engines. Apparently, this is
caused by differences in ranking algorithms rather than
the databases themselves. In fact, because scientificWeb
content in this subject domain receives few inlinks, both
AltaVista and AllTheWeb retrieved a higher degree of accessible
scientific content than Google. Because of the
search engine cutoffs of accessible URLs, the feasibility of
using search engine output for Web content analysis is
also discussed