• DocumentCode
    35987
  • Title

    Semantic Similarity Measures in the Biomedical Domain by Leveraging a Web Search Engine

  • Author

    Sheau-Ling Hsieh ; Wen-Yung Chang ; Chi-Huang Chen ; Yung-Ching Weng

  • Author_Institution
    Nat. Chiao Tung Univ., Hsinchu, Taiwan
  • Volume
    17
  • Issue
    4
  • fYear
    2013
  • fDate
    Jul-13
  • Firstpage
    853
  • Lastpage
    861
  • Abstract
    Various researches in web related semantic similarity measures have been deployed. However, measuring semantic similarity between two terms remains a challenging task. The traditional ontology-based methodologies have a limitation that both concepts must be resided in the same ontology tree(s). Unfortunately, in practice, the assumption is not always applicable. On the other hand, if the corpus is sufficiently adequate, the corpus-based methodologies can overcome the limitation. Now, the web is a continuous and enormous growth corpus. Therefore, a method of estimating semantic similarity is proposed via exploiting the page counts of two biomedical concepts returned by Google AJAX web search engine. The features are extracted as the co-occurrence patterns of two given terms P and Q, by querying P, Q, as well as P AND Q, and the web search hit counts of the defined lexico-syntactic patterns. These similarity scores of different patterns are evaluated, by adapting support vector machines for classification, to leverage the robustness of semantic similarity measures. Experimental results validating against two datasets: dataset 1 provided by A. Hliaoutakis; dataset 2 provided by T. Pedersen, are presented and discussed. In dataset 1, the proposed approach achieves the best correlation coefficient (0.802) under SNOMED-CT. In dataset 2, the proposed method obtains the best correlation coefficient (SNOMED-CT: 0.705; MeSH: 0.723) with physician scores comparing with measures of other methods. However, the correlation coefficients (SNOMED-CT: 0.496; MeSH: 0.539) with coder scores received opposite outcomes. In conclusion, the semantic similarity findings of the proposed method are close to those of physicians´ ratings. Furthermore, the study provides a cornerstone investigation for extracting fully relevant information from digitizing, free-text medical records in the National Taiwan University Hospital database.
  • Keywords
    Internet; feature extraction; medical computing; medical information systems; pattern classification; search engines; support vector machines; Google AJAX Web search engine; National Taiwan University Hospital database; SNOMED-CT; Web related semantic similarity measures; biomedical domain; correlation coefficient; feature extraction; free-text medical records; lexico-syntactic pattern; page counts; support vector machines; Semantic similarity; corpus-based; page-count-based; support vector machine; web search engine;
  • fLanguage
    English
  • Journal_Title
    Biomedical and Health Informatics, IEEE Journal of
  • Publisher
    ieee
  • ISSN
    2168-2194
  • Type

    jour

  • DOI
    10.1109/JBHI.2013.2257815
  • Filename
    6508818