An NML-based model selection criterion for general relational data modeling

Author

Sakai, Yoshiki ; Yamanishi, Kenji

Author_Institution

Grad. Sch. of Inf. Sci. & Technol., Univ. of Tokyo, Tokyo, Japan

fYear

2013

fDate

6-9 Oct. 2013

Firstpage

421

Lastpage

429

Abstract

Whereas the main interest in most existing data mining approaches has been sequence data on a single type of object, namely attribute data, real-world databases store information about multiple relationships between various classes of objects. The modeling of these general relational data (GRD) plays an important role in eliciting knowledge across multiple relations. It is not reasonable to directly apply existing modeling methods to GRD, because GRD have statistical properties that distinguish them from attribute data. In this paper, we address the issue of statistical model selection in GRD modeling. From the viewpoint of the minimum description length principle, we propose a new model selection criterion by considering the statistical properties of GRD. We employ the normalized maximum likelihood code-length as a model selection criterion, and provide an asymptotic expansion theorem for its application to GRD modeling. To demonstrate its use in a critical application, we apply our proposed criterion to the issue of model selection in relational data clustering. An experiment using artificial datasets demonstrates the effectiveness of our technique compared to other criteria, and we also present a brand analysis using real beer-purchase data.

Keywords

maximum likelihood estimation; pattern clustering; relational databases; GRD statistical properties; NML-based model selection criterion; asymptotic expansion theorem; attribute data; data mining approach; general relational data modeling; knowledge elicitation; minimum description length principle; normalized maximum likelihood code-length; real-world databases; relational data clustering; statistical model selection; Approximation methods; Bayes methods; Computational modeling; Data mining; Data models; Probabilistic logic; Stochastic processes; model selection; normalized maximum likelihood code-length; relational data; stochastic block model;

fLanguage

English

Publisher

ieee

Conference_Titel

Big Data, 2013 IEEE International Conference on

Conference_Location

Silicon Valley, CA

Type

conf

DOI

10.1109/BigData.2013.6691603

Filename

6691603