Oracle® Data Mining Concepts 11g Release 1 (11.1) Part Number B2812901 


View PDF 
This chapter describes clustering, the unsupervised mining function for discovering natural groupings within the data.
See Also:
"Unsupervised Data Mining"This chapter includes the following topics:
A cluster is a collection of data objects that are similar in some sense to one another. Clustering analysis identifies clusters in the data.
A good clustering method produces highquality clusters to ensure that the intercluster similarity is low and the intracluster similarity is high; in other words, members of a cluster are more like each other than they are like members of a different cluster.
Clustering is useful for exploring data. If there are many cases and no obvious natural groupings, clustering data mining algorithms can be used to find natural groupings. Clustering can also serve as a useful datapreprocessing step to identify homogeneous groups on which to build supervised models.
Clustering models are different from supervised models in that the outcome of the process is not guided by a known result, that is, there is no target attribute. Clustering models focus on the intrinsic structure, relations, and interconnectedness of the data. Clustering models are built using optimization criteria that favor high intracluster and low intercluster similarity. The model can then be used to assign cluster identifiers to data points.
In Oracle Data Mining, a cluster is characterized by its centroid, attribute histograms, and the cluster's place in the model's hierarchical tree. A centroid represents the most typical case in a cluster. For numerical clusters, the centroid is the mean. For categorical clusters, the centroid is the mode.
Oracle Data Mining performs hierarchical clustering using an enhanced version of the kmeans algorithm and Orthogonal Partitioning Clustering algorithm (OCluster), an Oracle proprietary algorithm.
The clusters discovered by these algorithms are used to create rules that capture the main characteristics of the data assigned to each cluster. The rules represent the bounding boxes that envelop the data in the clusters discovered by the clustering algorithm. The antecedent of each rule describes the clustering bounding box. The consequent encodes the cluster ID for the cluster described by the rule. For example, for a data set with two attributes: AGE and HEIGHT, the following rule represents most of the data assigned to cluster 10:
If AGE >= 25 and AGE <= 40 and HEIGHT >= 5.0ft and HEIGHT <= 5.5ft then CLUSTER = 10
The clusters are also used to generate a Bayesian probability model, which is used during scoring for assigning data points to clusters.
The main characteristics of the enhanced kmeans and OCluster algorithms are summarized in Table 71.
Table 71 Clustering Algorithms Compared
Feature  Enhanced kMeans  OCluster 

Clustering methodolgy 
Distancebased 
Gridbased 
Number of cases 
Handles data sets of any size 
More appropriate for data sets that have more than 500 cases. Handles large tables through active sampling 
Number of attributes 
More appropriate for data sets with a low number of attributes 
More appropriate for data sets with a high number of attributes 
Number of clusters 
Userspecified 
Automatically determined 
Hierarchical clustering 
Yes 
Yes 
Probabilistic cluster assignment 
Yes 
Yes 
Recommended data preparation 
Normalization 
Equiwidth binning after clipping 