Title :
Better rules, fewer features: a semantic approach to selecting features from text
Author :
Blake, Catherine ; Pratt, Wanda
Author_Institution :
Dept. of Inf. & Comput. Sci., California Univ., Irvine, CA, USA
Abstract :
The choice of features used to represent a domain has a profound effect on the quality of the model produced; yet, few researchers have investigated the relationship between the features used to represent text and the quality of the final model. We explored this relationship for medical texts by comparing association rules based on features with three different semantic levels: (1) words (2) manually assigned keywords and (3) automatically selected medical concepts. Our preliminary findings indicate that bi-directional association rules based on concepts or keywords are more plausible and more useful than those based on word features. The concept and keyword representations also required 90% fewer features than the word representation. This drastic dimensionality reduction suggests that this approach is well suited to large textual corpora of medical text, such as parts of the Web
Keywords :
bibliographic systems; computational linguistics; data mining; medical information systems; text analysis; Web; association rules; automatically selected medical concepts; bi-directional association rules; dimensionality reduction; feature selection; keyword representations; large textual corpus; manually assigned keywords; medical texts; semantic approach; semantic levels; text representation; word features; word representation; words; Association rules; Bidirectional control; Breast cancer; Breast neoplasms; Computer science; Data mining; Diseases; Medical treatment; Natural languages; Predictive models;
Conference_Titel :
Data Mining, 2001. ICDM 2001, Proceedings IEEE International Conference on
Conference_Location :
San Jose, CA
Print_ISBN :
0-7695-1119-8
DOI :
10.1109/ICDM.2001.989501