Title :
Healthcare Data Analytics Challenge
Author :
Zhiguo Yu;Byron C. Wallace;Todd R. Johnson
Author_Institution :
Sch. of Biomed. Inf. at Houston, Univ. of Texas, Houston, TX, USA
Abstract :
Online patient/caregiver support forums such as, cancer compass, ehealthforums, and patientslikeme, allow patients and caregivers to post health-related questions. In many of these forums, there is a significant volume of repetitive questions. One possible reason for this repetition could be that as forums grow longer, patients and caregivers do not have the time or patience to read through previous questions before posting their own question. The challenge here is to design and implement a system that, for a new question q, identifies a maximum of three existing questions that are most similar to q. In this challenge, we experimented with a variety of methods and representations to address this task, including approaches that leveraged topic modeling, distributional semantics (word2vec), and term frequency-inverse document frequencies (TF-IDF) to induce the vector representation of questions. For similarity measures, we used cosine similarity and the rescaled dot product over these feature spaces. Despite our experimentation with more recent methods, we found that simple TF-IDF with stemming using cosine similarity seemed to result in the best performance.
Keywords :
"Semantics","Databases","Diabetes","Frequency measurement","Encyclopedias","Electronic publishing"
Conference_Titel :
Healthcare Informatics (ICHI), 2015 International Conference on
DOI :
10.1109/ICHI.2015.96