Title : 
A comparative study of citations and links in document classification
         
        
            Author : 
Couto, Thierson ; Cristo, Marco ; Gonçalves, Marcos André ; Calado, Pável ; Ziviani, Nivio ; Moura, Edleno ; Ribeiro-Neto, Berthier
         
        
            Author_Institution : 
Comput. Sci. Dept., Fed. Univ. of Minas Gerais, Belo Horizonte
         
        
        
        
        
        
            Abstract : 
It is well known that links are an important source of information when dealing with Web collections. However, the question remains on whether the same techniques that are used on the Web can be applied to collections of documents containing citations between scientific papers. In this work we present a comparative study of digital library citations and Web links, in the context of automatic text classification. We show that there are in fact differences between citations and links in this context. For the comparison, we run a series of experiments using a digital library of computer science papers and a Web directory. In our reference collections, measures based on co-citation tend to perform better for pages in the Web directory, with gains up to 37% over text based classifiers, while measures based on bibliographic coupling perform better in a digital library. We also propose a simple and effective way of combining a traditional text based classifier with a citation-link based classifier. This combination is based on the notion of classifier reliability and presented gains of up to 14% in micro-averaged F1 in the Web collection. However, no significant gain was obtained in the digital library. Finally, a user study was performed to further investigate the causes for these results. We discovered that misclassifications by the citation-link based classifiers are in fact difficult cases, hard to classify even for humans
         
        
            Keywords : 
Internet; citation analysis; classification; digital libraries; text analysis; Web collection; citation analysis; digital library; document classification; text classification; Computer science; Gain measurement; Humans; Information retrieval; Performance evaluation; Performance gain; Permission; Software libraries; Text categorization; Web pages; digital libraries; links; text classification; web directories;
         
        
        
        
            Conference_Titel : 
Digital Libraries, 2006. JCDL '06. Proceedings of the 6th ACM/IEEE-CS Joint Conference on
         
        
            Conference_Location : 
Chapel Hill, NC
         
        
            Print_ISBN : 
1-59593-354-9
         
        
        
            DOI : 
10.1145/1141753.1141766