A mixture language model for class-attribute mining from biomedical literature digital library

Author

Zhou, Xiaohua ; Hu, Xiaohua ; Zhang, Xiaodan ; Wu, Daniel D.

Author_Institution

Data Min. & Bioinf. Lab., Drexel Univ., Philadelphia, PA

fYear

2007

fDate

2-4 Nov. 2007

Firstpage

174

Lastpage

182

Abstract

We define and study a novel text mining problem for biomedical literature digital library, referred to as the class-attribute mining. Given a collection of biomedical literature from a digital library addressing a set of objects (e.g., proteins) and their descriptions (e.g., protein functions), the tasks of class-attribute mining include: (1) to identify and summarize latent classes in the space of objects, (2) to discover latent attribute themes in the space of object descriptions, and (3) to summarize the commonalities and differences among identified classes along each attribute theme. We approach this mining problem through a mixture language model and estimate the parameters of the model using the EM algorithm. We demonstrate the effectiveness of the model with an application called protein community identification and annotation from Medline, the largest biomedical literature digital library with more than 16 millions abstracts.

Keywords

bibliographic systems; data mining; digital libraries; expectation-maximisation algorithm; medical information systems; EM algorithm; Medline; biomedical literature digital library; class-attribute text mining; latent attribute theme; mixture language model; object description; protein community identification; Abstracts; Bioinformatics; Context modeling; Data mining; Educational institutions; Information science; Parameter estimation; Proteins; Software libraries; Text mining;

fLanguage

English

Publisher

ieee

Conference_Titel

Bioinformatics and Biomedicine Workshops, 2007. BIBMW 2007. IEEE International Conference on

Conference_Location

Fremont, CA

Print_ISBN

978-1-4244-1604-2

Type

conf

DOI

10.1109/BIBMW.2007.4425416

Filename

4425416