Oracle® Data Mining Concepts 11g Release 1 (11.1) Part Number B2812901 


View PDF 
This chapter describes Orthogonal Partitioning Clustering (OCluster), an Oracleproprietary clustering algorithm.
See Also:
Chapter 7, "Clustering"Reference:
Campos, M.M., Milenova, B.L., "OCluster: Scalable Clustering of Large High Dimensional Data Sets", Oracle Data Mining Technologies, Copyright © 2002 Oracle Corporation.
This chapter contains the following topics
The OCluster algorithm creates a hierarchical gridbased clustering model, that is, it creates axisparallel (orthogonal) partitions in the input attribute space. The algorithm operates recursively. The resulting hierarchical structure represents an irregular grid that tessellates the attribute space into clusters. The resulting clusters define dense areas in the attribute space. The clusters are described by intervals along the attribute axes and the corresponding centroids and histograms. A parameter called sensitivity defines a baseline density level. Only areas with peak density above this baseline level can be identified as clusters.
The kmeans algorithm tessellates the space even when natural clusters may not exist. For example, if there is a region of uniform density, kMeans tessellates it into n clusters (where n is specified by the user). OCluster separates areas of high density by placing cutting planes through areas of low density. OCluster needs multimodal histograms (peaks and valleys). If an area has projections with uniform or monotonically changing density, OCluster does not partition it.
The clusters discovered by OCluster are used to generate a Bayesian probability model that is then used during scoring (model apply) for assigning data points to clusters. The generated probability model is a mixture model where the mixture components are represented by a product of independent normal distributions for numerical attributes and multinomial distributions for categorical attributes.
OCluster does not necessarily use all the input data when it builds a model. It reads the data in batches (the default batch size is 50000). It will only read another batch if it believes, based on statistical tests, that there may still exist clusters that it has not yet uncovered.
Because OCluster may stop the model build before it reads all of the data, it is highly recommended that the data be randomized.
The use of Oracle Data Mining's equiwidth binning transformation with automated estimation of the required number of bins is highly recommended.
Binary attributes should be declared as categorical.
The presence of outliers can significantly impact either type of clustering model. Use a clipping transformation before you bin or normalize the table to avoid the problems caused by outliers.
Outliers with equiwidth binning can prevent OCluster from detecting clusters. As a result, the whole population appears to falls within a single cluster.
Categorical data is mapped to numerical values.