• DocumentCode
    1338799
  • Title

    A Framework for Semisupervised Feature Generation and Its Applications in Biomedical Literature Mining

  • Author

    Li, Yanpeng ; Hu, Xiaohua ; Lin, Hongfei ; Yang, Zhihao

  • Author_Institution
    Coll. of Comput. Sci. & Technol., Dalian Univ. of Technol., Dalian, China
  • Volume
    8
  • Issue
    2
  • fYear
    2011
  • Firstpage
    294
  • Lastpage
    307
  • Abstract
    Feature representation is essential to machine learning and text mining. In this paper, we present a feature coupling generalization (FCG) framework for generating new features from unlabeled data. It selects two special types of features, i.e., example-distinguishing features (EDFs) and class-distinguishing features (CDFs) from original feature set, and then generalizes EDFs into higher-level features based on their coupling degrees with CDFs in unlabeled data. The advantage is: EDFs with extreme sparsity in labeled data can be enriched by their co-occurrences with CDFs in unlabeled data so that the performance of these low-frequency features can be greatly boosted and new information from unlabeled can be incorporated. We apply this approach to three tasks in biomedical literature mining: gene named entity recognition (NER), protein-protein interaction extraction (PPIE), and text classification (TC) for gene ontology (GO) annotation. New features are generated from over 20 GB unlabeled PubMed abstracts. The experimental results on BioCreative 2, AIMED corpus, and TREC 2005 Genomics Track show that 1) FCG can utilize well the sparse features ignored by supervised learning. 2) It improves the performance of supervised baselines by 7.8 percent, 5.0 percent, and 5.8 percent, respectively, in the tree tasks. 3) Our methods achieve 89.1, 64.5 F-score, and 60.1 normalized utility on the three benchmark data sets.
  • Keywords
    biology computing; data mining; genetics; learning (artificial intelligence); proteins; AIMED corpus; BioCreative 2; TREC 2005 Genomics Track; biomedical literature mining; class-distinguishing features; example-distinguishing features; feature coupling generalization; gene named entity recognition; gene ontology; protein-protein interaction extraction; semisupervised feature generation; supervised learning; text classification; Bioinformatics; Computational biology; Couplings; Feature extraction; Protein engineering; Proteins; Supervised learning; Feature coupling generalization; biomedical literature mining; named entity recognition; protein-protein interaction extraction; semisupervised learning; text classification.; Artificial Intelligence; Data Mining; Molecular Sequence Annotation; PubMed; Terminology as Topic; Vocabulary, Controlled;
  • fLanguage
    English
  • Journal_Title
    Computational Biology and Bioinformatics, IEEE/ACM Transactions on
  • Publisher
    ieee
  • ISSN
    1545-5963
  • Type

    jour

  • DOI
    10.1109/TCBB.2010.99
  • Filename
    5590239