DocumentCode
2392433
Title
Towards automatic Web genre identification: a corpus-based approach in the domain of academia by example of the Academic´s Personal Homepage
Author
Rehm, Georg
Author_Institution
Res. Unit for Appl. & Computational Linguistics, Justus-Liebig-Universitat, Giessen, Germany
fYear
2002
fDate
7-10 Jan. 2002
Firstpage
1143
Lastpage
1152
Abstract
We argue for a systematic analysis of one particular, well structured domain -academic Web pages - with regard to a special class of digital genres: Web genres. For this purpose, we have developed a database-driven system that will ultimately consist of more than 3000000 HTML documents, written in German, which are the empirical basis for our research. We introduce the notions of Web genre type which constitutes the basic framework for a certain Web genre, and compulsory and optional Web genre modules. These act as building blocks which go together to make up the structure characterised by the Web genre type and furthermore, operate as modifiers for the default assignment involved. The analysis of a 200 document sample illustrates our notion of Web genre hierarchy, into which Web genre types and modules are embedded. The analysis of four different documents of the Web genre Academic\´s Personal Homepage, not only illustrates our approach, but also our long-term goal of automatically extracting the contents of Web genre modules in order to build up structured XML documents of groups of unstructured HTML documents.
Keywords
hypermedia markup languages; information resources; Academic Personal Homepage; German language; HTML documents; Web genre hierarchy; Web genre type; academic Web pages; compulsory modules; database-driven system; optional modules; structured XML documents; unstructured HTML documents; Computational linguistics; Data mining; Databases; Educational institutions; HTML; Search engines; Tagging; Web pages; Web sites; XML;
fLanguage
English
Publisher
ieee
Conference_Titel
System Sciences, 2002. HICSS. Proceedings of the 35th Annual Hawaii International Conference on
Print_ISBN
0-7695-1435-9
Type
conf
DOI
10.1109/HICSS.2002.994036
Filename
994036
Link To Document