Abstract :
This paper considers model selection in classification. In many applications such as pattern recognition, probabilistic inference using a Bayesian network, prediction of the next in a sequence based on a Markov chain, the conditional probability P(Y=y|X=x) of class yisinY given attribute value xisinX is utilized. By model we mean the equivalence relation in X: for x,x´isinXx~x´hArrP(Y=y|X=x)=P(Y=y|X=x´), forall yisinY. By classification we mean the number of such equivalence classes is finite. We estimate the model from n samples zn=(xi,yi)i=1 n isin(XtimesY)n, using information criteria in the form empirical entropy H plus penalty term (k/2)dn (the model such that H+(k/2)dn is minimized is the estimated model), where k is the number of independent parameters in the model, and {dn}n=1 infin is a real nonnegative sequence such that lim supndn/n=0. For autoregressive processes, although the definitions of H and k are different, it is known that the estimated model almost surely coincides with the true model as nrarrinfin if {dn}n=1 infin>{2loglogn}n=1 infin, and that it does not if {dn}n=1 infin<{2loglogn}n=1 infin (Hannan and Quinn). The problem whether the same property is true for classification was open. This paper solves the problem in the affirmative
Keywords :
Markov processes; autoregressive processes; belief networks; entropy; pattern classification; probability; sequences; Bayesian network; Markov chain; autoregressive process; classification; conditional probability; empirical entropy; information criteria; model selection; sequence; strong consistency; Artificial intelligence; Autoregressive processes; Bayesian methods; Entropy; Intelligent networks; Mathematics; Pattern recognition; Random variables; Statistical learning; Statistics; Error probability; Hannan and Quinn´s procedure; Kullback–Leibler divergence; law of the iterated logarithm; model selection; strong consistency;