Title :
Identifying and extracting patient smoking status information from clinical narrative texts in Spanish
Author :
Figueroa, Rosa L. ; Soto, Diego A. ; Pino, Esteban J.
Author_Institution :
Univ. of Concepcion, Concepcion, Chile
Abstract :
In this work we present a system to identify and extract patient´s smoking status from clinical narrative text in Spanish. The clinical narrative text was processed using natural language processing techniques, and annotated by four people with a biomedical background. The dataset used for classification had 2,465 documents, each one annotated with one of the four smoking status categories. We used two feature representations: single word token and bigrams. The classification problem was divided in two levels. First recognizing between smoker (S) and non-smoker (NS); second recognizing between current smoker (CS) and past smoker (PS). For each feature representation and classification level, we used two classifiers: Support Vector Machines (SVM) and Bayesian Networks (BN). We split our dataset as follows: a training set containing 66% of the available documents that was used to build classifiers and a test set containing the remaining 34% of the documents that was used to test and evaluate the model. Our results show that SVM together with the bigram representation performed better in both classification levels. For S vs NS classification level performance measures were: ACC=85%, Precision=85%, and Recall=90%. For CS vs PS classification level performance measures were: ACC=87%, Precision=91%, and Recall=94%.
Keywords :
belief networks; classification; medical administrative data processing; natural language processing; smoke; support vector machines; text analysis; BN; Bayesian Networks; SVM; Spanish; Support Vector Machines; bigrams; classification; clinical narrative texts; current smoker; feature representation; natural language processing techniques; past smoker; patient smoking status information extraction; patient smoking status information identification; single word token; training set; Data mining; Documentation; Feature extraction; Medical diagnostic imaging; Medical services; Natural language processing; Support vector machines;
Conference_Titel :
Engineering in Medicine and Biology Society (EMBC), 2014 36th Annual International Conference of the IEEE
Conference_Location :
Chicago, IL
DOI :
10.1109/EMBC.2014.6944182