An Algorithm for Classifying Articles and Patent Documents Using Link Structure

Author

Indukuri, Kishore Varma ; Mirajkar, Pranav ; Sureka, Ashish

Author_Institution

SET Labs., Infosys Technol. Ltd., Bangalore

fYear

2008

fDate

20-22 July 2008

Firstpage

203

Lastpage

210

Abstract

Studying link structure of the World Wide Web (WWW) is an area which has attracted a lot of interest. Several papers have been published on structural analysis of hyperlinked environments such as the WWW. The WWW can be modeled as a graph and valuable information can be derived by analyzing links between the Web-pages primarily for the purpose of building better search engines. Many novel methods have been presented to discover communities from the WWW and discover authoritative Web-pages. Citation analysis is a branch of information science on which plenty of research has been done. Citation analysis pertains to analysis of articles and research paper citations in a scholarly field and deriving useful information from it. It has primarily been used as a useful tool to quantify and judge the impact of a paper or a journal. The work presented in this paper lies at the intersection of the two fields: structural analysis of WWW and citation analysis. In this paper, we present a method for classifying documents (such as articles and patents containing references) to a class or topic based on their link structure, references and citations. The method consists of analyzing the link structure of a corpus to first identify authoritative papers and assigning a class label to them. The class labels are assigned manually by a domain expert by going through the respective documents. The next step consists of identifying related papers to the authoritative papers using citation analysis. The authoritative papers, their class labels and their related papers constitute a model. Papers for which class label needs to be determined are classified based on the created model.

Keywords

Internet; citation analysis; classification; document handling; graph theory; information science; search engines; World Wide Web; articles classification; authoritative Web pages; citation analysis; graph; hyperlinked environments; information science; link structure; patent document classification; search engines; structural analysis; Association rules; Citation analysis; Cities and towns; Data mining; Government; Information analysis; Information management; Portfolios; Predictive models; World Wide Web; Bibliography Coupling; Citation graph; Co-citation; Document Similarity; Link Topology; Text Mining; Web Community;

fLanguage

English

Publisher

ieee

Conference_Titel

Web-Age Information Management, 2008. WAIM '08. The Ninth International Conference on

Conference_Location

Zhangjiajie Hunan

Print_ISBN

978-0-7695-3185-4

Electronic_ISBN

978-0-7695-3185-4

Type

conf

DOI

10.1109/WAIM.2008.31

Filename

4597015