DocumentCode :
3523574
Title :
Automatic document metadata extraction using support vector machines
Author :
Han, Hui ; Giles, C. Lee ; Manavoglu, Eren ; Zha, Hongyuan ; Zhang, Zhenyue ; Fox, Edward A.
Author_Institution :
Dept. of Comput. Sci. & Eng., Pennsylvania State Univ., University Park, PA, USA
fYear :
2003
fDate :
27-31 May 2003
Firstpage :
37
Lastpage :
48
Abstract :
Automatic metadata generation provides scalability and usability for digital libraries and their collections. Machine learning methods offer robust and adaptable automatic metadata extraction. We describe a support vector machine classification-based method for metadata extraction from header part of research papers and show that it outperforms other machine learning methods on the same task. The method first classifies each line of the header into one or more of 15 classes. An iterative convergence procedure is then used to improve the line classification by using the predicted class labels of its neighbor lines in the previous round. Further metadata extraction is done by seeking the best chunk boundaries of each line. We found that discovery and use of the structural patterns of the data and domain based word clustering can improve the metadata extraction performance. An appropriate feature normalization also greatly improves the classification performance. Our metadata extraction method was originally designed to improve the metadata extraction quality of the digital libraries Citeseer [S. Lawrence et al., (1999)] and EbizSearch [Y. Petinot et al., (2003)]. We believe it can be generalized to other digital libraries.
Keywords :
digital libraries; information retrieval; iterative methods; learning (artificial intelligence); learning automata; meta data; pattern classification; pattern clustering; text analysis; automatic document metadata extraction; data structural pattern; digital library; domain based word clustering; feature normalization; iterative convergence procedure; machine learning method; support vector machine classification-based method; Convergence; Data mining; Design methodology; Learning systems; Robustness; Scalability; Software libraries; Support vector machine classification; Support vector machines; Usability;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Digital Libraries, 2003. Proceedings. 2003 Joint Conference on
Print_ISBN :
0-7695-1939-3
Type :
conf
DOI :
10.1109/JCDL.2003.1204842
Filename :
1204842
Link To Document :
بازگشت