Title :
Semi-Supervised Clustering Models for Clinical Risk Assessment
Author :
Yongyang Ho ; Azuaje, Francisco ; McCullagh, Paul ; Harper, Roy
Author_Institution :
Sch. of Comput. & Math., Ulster Univ., Jordanstown
Abstract :
Clustering methods aim to organize a collection of cases into groupings, such that cases within one cluster are more similar to each other than to those in other clusters. A small amount of background knowledge may also be used to guide the clustering process and aid in the interpretation of results. This type of knowledge-driven clustering is known as semi-supervised clustering. This knowledge may be represented by pairwise constraints, labelled cases or known data groupings. Pairwise constraints may be specified, for example, as `MustLink´ or `CannotLink´ associations between cases. This research proposes a semi-supervised clustering method that exploits pairwise constraints and similarity information extracted from constrained cases. This semi-supervised clustering algorithm was first evaluated on publicly-available biomedical datasets. It was then applied to a Type II diabetes dataset to assess coronary heart disease (CHD) complication. This dataset comprises laboratory and physiological information from diabetic patients at the Ulster Hospital (UH) in Northern Ireland. The following methods were compared: traditional k-means, constraint-based k-means with pairwise constraints (CK method) and similarity-driven constraint-based k-means (SCK method). Results showed that the predictive quality, i.e. detection of relevant partitions and significant clusters, on these datasets was improved with a small amount of supervision (i.e. pairwise constraints automatically generated from the predefined class labels). Furthermore, the results from the UH dataset suggest significant associations between clustering outcomes with CHD complication in Type II diabetes patients
Keywords :
cardiology; data mining; diseases; knowledge representation; learning (artificial intelligence); medical information systems; pattern clustering; cannotlink associations; clinical risk assessment; constraint-based k-means method; coronary heart disease complication; information extraction; knowledge representation; knowledge-driven clustering; known data groupings; labelled cases groupings; laboratory information; mustlink associations; pairwise constraints; physiological information; predictive quality; publicly-available biomedical datasets; semisupervised clustering model; similarity-driven constraint-based k-means method; traditional k-means method; type II diabetes dataset; Cardiac disease; Clustering algorithms; Clustering methods; Data mining; Databases; Diabetes; Hospitals; Mathematical model; Mathematics; Risk management;
Conference_Titel :
BioInformatics and BioEngineering, 2006. BIBE 2006. Sixth IEEE Symposium on
Conference_Location :
Arlington, VA
Print_ISBN :
0-7695-2727-2
DOI :
10.1109/BIBE.2006.253341