DocumentCode
3336399
Title
Information extraction from scientific paper using rhetorical classifier
Author
Khodra, Masayu Leylia ; Widyantoro, Dwi H. ; Aziz, E.A. ; Bambang, Riyanto Trilaksono
Author_Institution
Sch. of Electr. Eng., Bandung Inst. of Technol., Bandung, Indonesia
fYear
2011
fDate
17-19 July 2011
Firstpage
1
Lastpage
5
Abstract
Time constraints often lead a reader of scientific paper to read only the title and abstract of the paper, but reading these parts is often ineffective. This study aims to extract information automatically in order to help the readers get structured information from a scientific paper. The information extraction is done by rhetorical classification of each sentence in a scientific paper. Rhetoric information is the intention to be conveyed to the reader by the author of the paper. This research used corpus-based approach to build rhetorical classifier. Since there was a lack of rethorical corpus, we constructed our own corpus, which is a collection of sentences that have been labeled with rhetorical information. Each sentence represented as a vector of content, location, citation, and meta-discourses features. This collection of feature vectors is used to build rhetorical classifiers by using machine learning techniques. Experiments were conducted to select the best learning techniques for rhetorical classifier. Training set consists of 7239 labeled sentences, and the testing set consists of 3638 labeled sentences. We used WEKA (Waikato Environment for Knowledge Analysis) and LibSVM libraries. Learning techniques being considered were Naive Bayes, C4.5, Logistic, Multi-Layer Perceptron, PART, Instance-based Learning, and Support Vector Machines (SVM). The best performers are the SVM and Logistic classifier with accuracy of 0.51. By applying one-against-all strategy, the SVM accuracy can be improved to 0.60.
Keywords
Bayes methods; information retrieval; learning (artificial intelligence); multilayer perceptrons; pattern classification; support vector machines; C4.5 learning technique; LibSVM libraries; PART; Waikato environment for knowledge analysis; corpus-based approach; feature vectors; information extraction; instance-based learning; logistic learning technique; machine learning techniques; multilayer perceptron; naive Bayes; rhetoric information; rhetorical classifier; scientific paper; sentence rhetorical classification; support vector machines; Accuracy; Data mining; Feature extraction; Logistics; Machine learning; Support vector machines; Training; SVM classifier; information extraction; rhetorical classifier; rhetorical corpus; scientific paper;
fLanguage
English
Publisher
ieee
Conference_Titel
Electrical Engineering and Informatics (ICEEI), 2011 International Conference on
Conference_Location
Bandung
ISSN
2155-6822
Print_ISBN
978-1-4577-0753-7
Type
conf
DOI
10.1109/ICEEI.2011.6021634
Filename
6021634
Link To Document