1 Data Mining API Use Cases

This chapter provides several examples that show how you can perform complex data mining operations by using simple SQL queries and PL/SQL procedures. The examples show techniques for obtaining valuable information about your customers. This kind of information is needed to answer business questions, such as: How should we position a product? When should we introduce a new product? What pricing and sales promotion strategies should we use?

Note:

The base interfaces to Oracle Data Mining are PL/SQL packages and SQL functions. Most of the examples throughout this manual use PL/SQL and SQL code.

Oracle Data Mining also supports a standards-based Java API, which is layered on the SQL API. The Java API is described in "Data Mining Java API" and in Chapter 7.

Analyze Customer Demographics and Buying Patterns

The code fragments in this section show how you can apply data mining models to learn more about your customers.

These examples use models that were previously created. They exist in Oracle Database as database objects. For the sake of simplicity, we can assume that you created the models, that they exist within your schema, and that you are executing the SQL statements that apply them. You could have used the PL/SQL API or the Java API to create the models.

Segment Customer Base and Predict Attrition

The query in Example 1-1 divides customers into segments and predicts customer attrition.

The segments are created by a clustering model named clus_model. The model uses heuristics to divide the customer database into groups that have similar characteristics.The prediction is generated by a classification model called svmC_model.

The model predicts the attrition probability for each customer in each segment. The data is returned by segment, ordered by the overall attrition propensity in the segment.

Example 1-1 Predict Attrition for Customer Segments

SELECT count(*) as cnt, 
       AVG(PREDICTION_PROBABILITY(svmC_model,
               'attrite' USING *)) as avg_attrite,
       AVG(cust_value_score)
  FROM customers 
 GROUP BY CLUSTER_ID(clus_model USING *)
 ORDER BY avg_attrite DESC;

The sophisticated analytics within this seemingly simple query let you see your customer base within natural groupings based on similarities. The SQL data mining functions show you where your most loyal and least loyal customers are. This information can help you make decisions about how to sell your product. For example, if most of the customers in one segment have a high probability to attrite, you might take a closer look at the demographics of that segment when considering your marketing strategy.

The CLUSTER_ID function in Example 1-1 applies the mining model clus_model to the customers table. It returns the segment for each customer. The number of segments is determined by the clustering algorithm. Oracle Data Mining supports two clustering algorithms: enhanced k-Means and O-Cluster.

The PREDICTION_PROBABILITY function in Example 1-1 applies the mining model svmC_model and returns the probability to attrite for each customer. Oracle Data Mining supports a number of classification algorithms. In this case, the model name implies that Support Vector Machine was chosen as the classifier.

Predict Missing Incomes

The sample query in Example 1-2 returns the ten customers who are most likely to attrite, based on age, gender, annual_income, and zipcode.

In addition, since annual_income is often missing, the PREDICTION function is used to perform missing value imputation for the annual_income attribute. The PREDICTION function applies the regression model, svmR_model, to predict the most likely annual income. The NVL function replaces missing annual income values with the resulting prediction.

Example 1-2 Predict Customer Attrition and Transform Missing Income Values

SELECT * FROM (
  SELECT cust_name, cust_contact_info
    FROM customers
   ORDER BY 
     PREDICTION_PROBABILITY(tree_model, 'attrite'
       USING age, gender, zipcode,
         NVL(annual_income, 
             PREDICTION(svmR_model USING *)) 
           as annual_income) DESC)
WHERE rownum < 11;

Find Anomalies in the Customer Data

These examples use a one-class SVM model to discover atypical customers (outliers), find common demographic characteristics of the most typical customers, and compute the probability that a new or hypothetical customer will be a typical affinity card holder.

Find the Top 10 Outliers

Find the top 10 outliers -- customers that differ the most from the rest of the population. Depending on the application, such atypical customers can be removed from the data.

SELECT cust_id FROM (
 SELECT cust_id
 FROM svmo_sh_sample_prepared
  ORDER BY prediction_probability(SVMO_SH_Clas_sample, 0 using *) DESC, 1)
WHERE rownum < 11;

Compute the Probability of a New Customer Being a Typical Affinity Card Member

Compute the probability of a new or hypothetical customer being a typical affinity card holder. Normalization of the numerical attributes is performed on-the-fly.

WITH age_norm AS (
   SELECT shift, scale FROM SVMO_SH_sample_norm WHERE col = 'AGE'),
     yrs_residence_norm AS (
       SELECT shift, scale FROM svmo_sh_sample_norm WHERE col = 'YRS_RESIDENCE')
       SELECT prediction_probability(SVMO_SH_Clas_sample, 1 using 
                             (44 - a.shift)/a.scale AS age,
                             (6 - b.shift)/b.scale AS yrs_residence,
                             'Bach.' AS education,
                             'Married' AS cust_marital_status,
                             'Exec.' AS occupation,
                             'United States of America' AS country_name,
                             'M' AS cust_gender,
                             'L: 300,000 and above' AS cust_income_level,
                             '3' AS houshold_size
                             ) prob_typical
FROM age_norm a, yrs_residence_norm b;

Find the Demographics of a Typical Affinity Card Member

Find demographic characteristics of the typical affinity card members. These statistics will not be influenced by outliers and are likely to provide a more truthful picture of the population of interest than statistics computed on the entire group of affinity members.

SELECT a.cust_gender, round(avg(a.age)) age, 
       round(avg(a.yrs_residence)) yrs_residence,
       count(*) cnt
FROM mining_data_one_class_v a
WHERE PREDICTION(SVMO_SH_Clas_sample using *) = 1
GROUP BY a.cust_gender
ORDER BY a.cust_gender;

Evaluate the Success of a Marketing Campaign

This example uses a classification model to predict who will respond to a marketing campaign for DVDs and why.

First predict the responders.

This statement uses the PREDICTION and PREDICTION_DETAILS functions to apply the model campaign_model.

SELECT cust_name, 
   PREDICTION(campaign_model USING *) 
      AS responder,
   PREDICTION_DETAILS(campaign_model USING *) 
      AS reason
FROM customers;

Combine the predictions with relational data.

This statement combines the predicted responders with additional information from the sales table. In addition to predicting the responders, it shows how much each customer has spent for a period of three months before and after the start of the campaign.

SELECT cust_name, 
        PREDICTION(campaign_model USING *) AS responder,
        SUM(CASE WHEN purchase_date < 15-Apr-2005 THEN
            purchase_amt ELSE 0 END) AS pre_purch,
        SUM(CASE WHEN purchase_date >= 15-Apr-2005 THEN
            purchase_amt ELSE 0 END) AS post_purch
 FROM customers, sales 
 WHERE sales.cust_id = customers.cust_id 
   AND purchase_date BETWEEN 15-Jan-2005 AND 14-Jul-2005
 GROUP BY cust_id, PREDICTION(campaign_model USING *);

Combine the predictions and relational data with multi-domain, multi-database data.

In addition to predicting responders, find out how much each customer has spent on DVDs for a period of three months before and after the start of the campaign.

SELECT cust_name, 
        PREDICTION(campaign_model USING *) as responder,
        SUM(CASE WHEN purchase_date < 15-Apr-2005 THEN
            purchase_amt ELSE 0 END) AS pre_purch,
        SUM(CASE WHEN purchase_date >= 15-Apr-2005 THEN
            purchase_amt ELSE 0 END) AS post_purch
 FROM customers, sales, products@PRODDB 
 WHERE sales.cust_id = customers.cust_id 
   AND purchase_date BETWEEN 15-Jan-2005 AND 14-Jul-2005
   AND sales.prod_id = products.prod_id
   AND CONTAINS(prod_description, 'DVD') > 0 
 GROUP BY cust_id, PREDICTION(campaign_model USING *);

Evaluate the effectiveness and significance of the information you have obtained.

Compare the success rate of predicted responders and non-responders within different regions and across the company. Is the success statistically significant?

SELECT responder, cust_region, COUNT(*) AS cnt,
       SUM(post_purch – pre_purch) AS tot_increase,
       AVG(post_purch – pre_purch) AS avg_increase,
       STATS_T_TEST_PAIRED(pre_purch, post_purch) AS significance
FROM (
 SELECT cust_name, cust_region
        PREDICTION(campaign_model USING *) AS responder,
        SUM(CASE WHEN purchase_date < 15-Apr-2005 THEN
            purchase_amt ELSE 0 END) AS pre_purch,
        SUM(CASE WHEN purchase_date >= 15-Apr-2005 THEN
            purchase_amt ELSE 0 END) AS post_purch
 FROM customers, sales, products@PRODDB 
 WHERE sales.cust_id = customers.cust_id 
   AND purchase_date BETWEEN 15-Jan-2005 AND 14-Jul-2005
   AND sales.prod_id = products.prod_id
   AND CONTAINS(prod_description, 'DVD') > 0 
 GROUP BY cust_id, PREDICTION(campaign_model USING *) )
GROUP BY ROLLUP responder, cust_region ORDER BY 4 DESC;

Use Predictive Analytics to Create a Customer Profile

Predictive analytics, implemented in the DBMS_PREDICTIVE_ANALYTICS package, support routines for making predictions, assessing attribute importance, and creating profiles. Example 1-3 shows how you could use predictive analytics to generate a customer profile.

Note:

With predictive analytics, you do not need to create a model. The routine dynamically creates and applies a model, which does not persist upon completion.

The PROFILE statement in Example 1-3 returns rules that suggest whether or not a customer is likely to use an affinity card. The rules are generated based on two predictors: customer gender and customer occupation. The rules are written as XML to a results table with these columns.

Name                        Type
 -------------------------    ------------
 PROFILE_ID                   NUMBER
 RECORD_COUNT                 NUMBER
 DESCRIPTION                  XMLTYPE

The rule identifier is stored in PROFILE_ID. The number of cases described by the rule is stored in RECORD_COUNT. The XML that describes the rule is stored in the DESCRIPTION column.

Note:

This example uses sample data based on the SH schema. This data is used with the Data Mining sample programs. For information on the sample programs, see Oracle Data Mining Administrator's Guide.

Example 1-3 Generate a Customer Profile

--create a source view
070307
CREATE VIEW cust_gend_occ_view AS
                 SELECT cust_gender, occupation, affinity_card
                 FROM mining_data_apply;

-- describe the source data
DESCRIBE cust_gend_occ_view
 Name                            Null?    Type
 ----------------------------------------------
 CUST_GENDER                     VARCHAR2(1)
 OCCUPATION                      VARCHAR2(21)
 AFFINITY_CARD                   NUMBER(10)

-- find the rules
BEGIN
    DBMS_PREDICTIVE_ANALYTICS.PROFILE(
         DATA_TABLE_NAME    => 'cust_gend_occ_view',
         TARGET_COLUMN_NAME => 'affinity_card',
         RESULT_TABLE_NAME  => 'profile_result');
END;
/

-- PROFILE has created 5 rules
SELECT profile_id from cust_gend_occ_profile_results;

PROFILE_ID
----------
         1
         2
         3
         4
         5

-- display the rules

<SimpleRule id="1" score="1" recordCount="275">
  <CompoundPredicate booleanOperator="and">
    <SimpleSetPredicate field="OCCUPATION" booleanOperator="isIn">
      <Array type="string">"Exec." "Prof." "Protec."
      </Array>
    </SimpleSetPredicate>
    <SimpleSetPredicate field="CUST_GENDER" booleanOperator="isIn">
      <Array type="string">"M"</Array>
    </SimpleSetPredicate>
  </CompoundPredicate>
<ScoreDistribution value="1" recordCount="146"/>
<ScoreDistribution value="0" recordCount="129"/>
</SimpleRule>

<SimpleRule id="2" score="0" recordCount="124">
  <CompoundPredicate booleanOperator="and">
    <SimpleSetPredicate field="OCCUPATION" booleanOperator="isIn">
      <Array type="string">"Exec." "Prof." "Protec."
      </Array>
    </SimpleSetPredicate>
    <SimpleSetPredicate field="CUST_GENDER" booleanOperator="isIn">
      <Array type="string">"F"
      </Array>
    </SimpleSetPredicate>
  </CompoundPredicate>
<ScoreDistribution value="0" recordCount="96"/>
<ScoreDistribution value="1" recordCount="28"/>
</SimpleRule>

<SimpleRule id="3" score="0" recordCount="397">
  <CompoundPredicate booleanOperator="and">
    <SimpleSetPredicate field="CUST_GENDER" booleanOperator="isIn">
      <Array type="string">"M"
      </Array>
    </SimpleSetPredicate>
    <SimpleSetPredicate field="OCCUPATION" booleanOperator="isIn">
      <Array type="string">"Crafts" "Sales" "TechSup" "Transp."
      </Array>
    </SimpleSetPredicate>
  </CompoundPredicate>
<ScoreDistribution value="0" recordCount="289"/>
<ScoreDistribution value="1" recordCount="108"/>
</SimpleRule>

<SimpleRule id="4" score="0" recordCount="316">
  <CompoundPredicate booleanOperator="and">
    <SimpleSetPredicate field="CUST_GENDER" booleanOperator="isIn">
      <Array type="string">"M"
      </Array>
    </SimpleSetPredicate>
    <SimpleSetPredicate field="OCCUPATION" booleanOperator="isIn">
      <Array type="string">
        "?" "Cleric." "Farming" "Handler" "House-s" "Machine" "Other"
      </Array>
    </SimpleSetPredicate>
  </CompoundPredicate>
<ScoreDistribution value="0" recordCount="277"/>
<ScoreDistribution value="1" recordCount="39"/>
</SimpleRule

<SimpleRule id="5" score="0" recordCount="388">
  <CompoundPredicate booleanOperator="and">
    <SimpleSetPredicate field="OCCUPATION" booleanOperator="isIn">
      <Array type="string">
        "?" "Cleric." "Crafts" "Farming" "Handler" "House-s" "Machine" 
        "Other" "Sales" "TechSup" "Transp."
      </Array>
    </SimpleSetPredicate>
    <SimpleSetPredicate field="CUST_GENDER" booleanOperator="isIn">
      <Array type="string">"F"
      </Array>
    </SimpleSetPredicate>
  </CompoundPredicate>
<ScoreDistribution value="0" recordCount="363"/>
<ScoreDistribution value="1" recordCount="25"/>
</SimpleRule>