Title :
Text Representations for Text Categorization: A Case Study in Biomedical Domain
Author :
Lan, Man ; Tan, Chew Lim ; Su, Jian ; Low, Hwee Boon
Author_Institution :
Nat. Univ. of Singapore, Singapore
Abstract :
In vector space model (VSM), textual documents are represented as vectors in the term space. Therefore, there are two issues in this representation, i.e. (1) what should a term be and (2) how to weight a term. This paper examined ways to represent text from the above two aspects to improve the performance of text categorization. Different representations have been evaluated using SVM on three biomedical corpora. The controlled experiments showed that the straightforward usage of named entities as terms in VSM does not show performance improvements over the bag-of-words representation. On the other hand, the term weighting method slightly improved the performance. However, to further improve the performance of text categorization, more advanced techniques and more effective usages of natural language processing for text representations appear needed.
Keywords :
classification; medical computing; medical information systems; natural language processing; support vector machines; text analysis; SVM; bag-of-words text representation; biomedical domain; natural language processing; text categorization; textual document representation; vector space model; Data mining; Indexing; Information management; Information retrieval; Local area networks; Natural language processing; Neural networks; Proteins; Support vector machines; Text categorization;
Conference_Titel :
Neural Networks, 2007. IJCNN 2007. International Joint Conference on
Conference_Location :
Orlando, FL
Print_ISBN :
978-1-4244-1379-9
Electronic_ISBN :
1098-7576
DOI :
10.1109/IJCNN.2007.4371361