Skip Headers
Oracle® Data Mining Concepts
11g Release 1 (11.1)

Part Number B28129-01
Go to Documentation Home
Home
Go to Book List
Book List
Go to Table of Contents
Contents
Go to Index
Index
Go to Master Index
Master Index
Go to Feedback page
Contact Us

Go to previous page
Previous
Go to next page
Next
View PDF

7 Clustering

This chapter describes clustering, the unsupervised mining function for discovering natural groupings within the data.

See Also:

"Unsupervised Data Mining"

This chapter includes the following topics:

About Clustering

A cluster is a collection of data objects that are similar in some sense to one another. Clustering analysis identifies clusters in the data.

A good clustering method produces high-quality clusters to ensure that the inter-cluster similarity is low and the intra-cluster similarity is high; in other words, members of a cluster are more like each other than they are like members of a different cluster.

Clustering is useful for exploring data. If there are many cases and no obvious natural groupings, clustering data mining algorithms can be used to find natural groupings. Clustering can also serve as a useful data-preprocessing step to identify homogeneous groups on which to build supervised models.

Clustering models are different from supervised models in that the outcome of the process is not guided by a known result, that is, there is no target attribute. Clustering models focus on the intrinsic structure, relations, and interconnectedness of the data. Clustering models are built using optimization criteria that favor high intra-cluster and low inter-cluster similarity. The model can then be used to assign cluster identifiers to data points.

In Oracle Data Mining, a cluster is characterized by its centroid, attribute histograms, and the cluster's place in the model's hierarchical tree. A centroid represents the most typical case in a cluster. For numerical clusters, the centroid is the mean. For categorical clusters, the centroid is the mode.

Clustering Algorithms

Oracle Data Mining performs hierarchical clustering using an enhanced version of the k-means algorithm and Orthogonal Partitioning Clustering algorithm (O-Cluster), an Oracle proprietary algorithm.

The clusters discovered by these algorithms are used to create rules that capture the main characteristics of the data assigned to each cluster. The rules represent the bounding boxes that envelop the data in the clusters discovered by the clustering algorithm. The antecedent of each rule describes the clustering bounding box. The consequent encodes the cluster ID for the cluster described by the rule. For example, for a data set with two attributes: AGE and HEIGHT, the following rule represents most of the data assigned to cluster 10:

If AGE >= 25 and AGE <= 40 and HEIGHT >= 5.0ft and HEIGHT <= 5.5ft then CLUSTER = 10

The clusters are also used to generate a Bayesian probability model, which is used during scoring for assigning data points to clusters.

The main characteristics of the enhanced k-means and O-Cluster algorithms are summarized in Table 7-1.

Table 7-1 Clustering Algorithms Compared

Feature Enhanced k-Means O-Cluster

Clustering methodolgy

Distance-based

Grid-based

Number of cases

Handles data sets of any size

More appropriate for data sets that have more than 500 cases. Handles large tables through active sampling

Number of attributes

More appropriate for data sets with a low number of attributes

More appropriate for data sets with a high number of attributes

Number of clusters

User-specified

Automatically determined

Hierarchical clustering

Yes

Yes

Probabilistic cluster assignment

Yes

Yes

Recommended data preparation

Normalization

Equi-width binning after clipping