Skip Headers
Oracle® Data Mining Concepts
11g Release 1 (11.1)

Part Number B28129-01
Go to Documentation Home
Go to Book List
Book List
Go to Table of Contents
Go to Index
Go to Master Index
Master Index
Go to Feedback page
Contact Us

Go to previous page
Go to next page
View PDF

13 k-Means

This chapter describes the enhanced k-Means clustering algorithm supported by Oracle Data Mining.

See Also:

Chapter 7, "Clustering"

This chapter includes the following topics:

About k-Means

The k-Means algorithm is a distance-based clustering algorithm that partitions the data into a predetermined number of clusters (provided there are enough distinct cases). Distance-based algorithms rely on a distance metric (function) to measure the similarity between data points. The distance metric is either Euclidean, Cosine, or Fast Cosine distance. Data points are assigned to the nearest cluster according to the distance metric used.

Oracle Data Mining implements an enhanced version of the k-means algorithm with the following features:

This approach to k-means avoids the need for building multiple k-means models and provides clustering results that are consistently superior to the traditional k-means.

Scoring k-Means Clustering Models

The clusters discovered by enhanced k-Means are used to generate a Bayesian probability model that is then used during scoring (model apply) for assigning data points to clusters. The k-means algorithm can be interpreted as a mixture model where the mixture components are spherical multivariate normal distributions with the same variance for all components.

Data Preparation for k-Means

The Oracle Data Mining implementation of k-Means supports both categorical and numerical data.

For numerical attributes, data normalization is recommended.

For the k-Means algorithm, NULL values indicate sparse data. Missing values are not automatically handled. If the data is not sparse and the values are indeed missing at random, you should perform missing data imputation (that is, perform some kind of missing values treatment) and substitute a non-NULL value for the NULL value. One simple way to treat missing values is to use the mean for numerical attributes and the mode for categorical attributes. If you do not treat missing values, the algorithm will not handle the data correctly.

Outliers with equi-width binning can prevent k-Means from creating clusters that are different in content. The clusters may have very similar centroids, histograms, and rules.