14 Minimum Description Length

This chapter describes Minimum Description Length, the supervised technique for calculating attribute importance.

About MDL

Minimum Description Length (MDL) is an information theoretic model selection principle. MDL assumes that the simplest, most compact representation of data is the best and most probable explanation of the data. The MDL principle is used to build Oracle Data Mining attribute importance models.

MDL considers each attribute as a simple predictive model of the target class. These single predictor models are compared and ranked with respect to the MDL metric (compression in bits). MDL penalizes model complexity to avoid over-fit. It is a principled approach that takes into account the complexity of the predictors (as models) to make the comparisons fair.

With MDL, the model selection problem is treated as a communication problem. There is a sender, a receiver, and data to be transmitted. For classification models, the data to be transmitted is a model and the sequence of target class values in the training data.

Attribute importance uses a two-part code to transmit the data. The first part (preamble) transmits the model. The parameters of the model are the target probabilities associated with each value of the prediction. For a target with j values and a predictor with k values, n_i (i= 1,..., k) rows per value, there are C_i, the combination of j-1 things taken n_i-1 at time possible conditional probabilities. The size of the preamble in bits can be shown to be Sum(log₂(C_i)), where the sum is taken over k. Computations like this represent the penalties associated with each single prediction model. The second part of the code transmits the target values using the model.

It is well known that the most compact encoding of a sequence is the encoding that best matches the probability of the symbols (target class values). Thus, the model that assigns the highest probability to the sequence has the smallest target class value transmission cost. In bits this is the Sum(log₂(p_i)), where the p_i are the predicted probabilities for row _i associated with the model.

The predictor rank is the position in the list of associated description lengths, smallest first.

Data Preparation for MDL

When these algorithms use equi-width binning, outliers cause most of the data to concentrate in a few bins, sometimes a single bin. As a result, the discriminating power of these algorithms can be significantly reduced.

Similarly, an association model might have all the values of a numerical attribute concentrated in a single bin, except for one value (the outlier) that belongs to a different bin. If, for example, this attribute is income, there will not be any rules reflecting different levels of income. All rules containing income will only reflect the range in the single bin; this range is basically the income range for the whole population.