Oracle Data Mining and Analytics: The SQL of Analytics 1

What do we get when we combine the power of SQL with advanced analytics? The answer: a revolution. The Oracle RDBMS has added, over recent releases, support for advanced analytics to the SQL vocabulary. Because analytics is now part of the vocabulary, it can be used to answer all types of questions. Below I show examples of questions that we can ask using this new vocabulary. At the end I also show how easy it is to create models and how to find out the name of the models available to a user.

This is Part 1 in a series on analytics and SQL. This post covers data mining. A future post will cover statistical functions.

The combination of simplified model creation (see below) with the flexibility of SQL is a powerful tool for developing intelligent applications. It moves analytics away from the domain of experts, capable of navigating complex methodologies, into the hands of database and application developers. Developers can now use their creativity to combine and leverage data mining results in a myriad of ways. This changes the dynamics of how data mining is used in applications and opens up the opportunity for an increase in the number of "intelligent" applications. That is why I believe this is revolutionary.

My goal with the following examples is to illustrate how easy it is to use data mining results to create interesting and powerful queries. These types of queries make it straightforward for developers to add data mining results to applications and reports. The examples progress from simple to more complex queries.

Simple Prediction Query
Question:

Select all customers who have a high propensity to attrite (> 80% chance)

SQL Query:

      SELECT A.cust_name, A.contact_info
      FROM   customers A
      WHERE  PREDICTION_PROBABILITY(tree_model, ‘attrite’ 
                                    USING A.*) > 0.8

Comments:
This question can be answered with a simple predictive SQL query. The PREDICTION_PROBABILITY SQL function computes the probability that a customer will attrite using a customer attrition model called tree_model. This probability is used in the WHERE clause to select customers with a high propensity to attrite (probability greater than 80%). The customer attrition model used in the query was built with the type of information present in the customers table. The USING clause allows the * wildcard. It takes all the columns in customers and utilizes only the ones required by model tree_model.

Incorporating Other Relational Information
Question:

Select all customers who have a high propensity to attrite (> 80% chance) and have a customer value rating of more than 90

SQL Query:

      SELECT A.cust_name, A.contact_info
      FROM   customers A
      WHERE  PREDICTION_PROBABILITY(tree_model,‘attrite’ 
                                    USING A.*) > 0.8
      AND    A.cust_value > 90

Comments:
It is also very easy to answer more complex questions by incorporating other relational information. In this example, customers are further filtered based on the information in the cust_value column.

Multi-domain, Multi-database Analysis
Question:

Select all customers who have a high propensity to attrite (> 80% chance) and have a customer value rating of more than 90 and have had a recent conversation with customer service regarding a Checking Plus account.

SQL Query:

     SELECT A.cust_name, A.contact_info
     FROM   customers A
     WHERE  PREDICTION_PROBABILITY(tree_model,‘attrite’ 
                                   USING A.*) > 0.8
     AND    A.cust_value > 90
     AND    A.cust_id IN 
                 (SELECT B.cust_id
                  FROM   call_center@HQ_DB B
                  WHERE  B.call_date BETWEEN ’01-Jan-2005’
                  AND    ’30-Jun-2005’
                  AND    CONTAINS(B.notes, ‘Checking Plus’, 1) > 0)

Comments:
This example builds upon the previous query and shows how the many capabilities of the RDBMS can be combined to answer complex questions. The query combines the previous results with call center information from the table call_center residing in a remote database. The remote database is accessed through a database link (HQ_DB). The query also uses Oracle Text CONTAINS operator to select customers that have mentioned 'Checking Plus' in their conversations with customer services within a six-months time window.

Gaining Further Insight
Question:

Select all customers who have a high propensity to attrite (> 80% chance) and have a customer value rating of more than 90 and have had a recent conversation with customer service regarding a Checking Plus account.

BTW, also tell me how you got this prediction.

SQL Query:

      SELECT A.cust_name, A.contact_info,
             PREDICTION_DETAILS(tree_model USING A.*)
      FROM   customers A
      WHERE  PREDICTION_PROBABILITY(tree_model,‘attrite’ 
                                    USING A.*) > 0.8
      AND    A.cust_value > 90
      AND    A.cust_id IN 
                  (SELECT B.cust_id
                   FROM   call_center B
                   WHERE  B.call_date BETWEEN ’01-Jan-2005’
                   AND    ’30-Jun-2005’
                   AND    CONTAINS(B.notes, ‘Checking Plus’, 1) > 0);

Comments:
This query adds an explanation, in the form of the rule used to make the decision, to the previous query. The function PREDICTION_DETAILS returns the node ID for the rule used in the decision. The actual rule can be obtained using the function DBMS_DATA_MINING.GET_MODEL_DETAILS_XML.

Descriptive and Predictive SQL - Combining Multiple Models
Question:

For each segment based on our customer segmentation model, provide the number of customers in that segment as well as an average score for customer value and an average probability to attrite. Return this aggregate information ordered based on the overall segments’ propensity to attrite.

SQL Query:

      SELECT count(*) as cnt,
             avg(PREDICTION_PROBABILITY(svmC_model, ‘attrite’
                                        USING *)) as avg_attrite,
             avg(cust_value_score)
      FROM   customers
      GROUP BY CLUSTER_ID(clus_model USING *)
      ORDER BY avg_attrite DESC;

Comments:
This example introduces another data mining SQL function (CLUSTER_ID). It shows how we can combine models of different types to answer a larger array of interesting questions.

Function Composition (e.g., Missing Value Imputation)
Question:

Select the 10 customers who are most likely to attrite based solely on: age, gender, annual_income, and zipcode.

In addition, since annual_income is often missing, perform missing value imputation for the annual_income attribute using all of the customer demographics.

SQL Query:

      SELECT *
      FROM (SELECT cust_name, cust_contact_info
            FROM   customers
            ORDER BY PREDICTION_PROBABILITY(
                            tree_model, 'attrite'
                            USING age, gender, zipcode,
                            NVL(annual_income,
                                PREDICTION(svmR USING *)
                               ) AS annual_income) DESC
            )
      WHERE rownum < 11;

Comments:
This example orders customers by their probability to attrite using an attrition model (tree_model). It also uses another model (svmR) for missing value imputation for the annual_income attribute.

Auxiliary/Dependent Data Structures
Data mining models can also be used in auxiliary/dependent data structures, for example:

Materialized views
Functional indexes

This allows pre-computation of results for faster response time. The dependencies between a model and the dependent structures are also maintained. For example, if we drop a model used in a materialized view, the view is invalidated.

A functional index using a data mining model (svmR) can be created as follows:

      CREATE INDEX anninc_idx on customers
      (PREDICTION (svmR USING *))

As new data are added to the customers table the model is scored and the results are stored in the index. The index can then be used to speed up queries on customers that require the model's prediction. For example, let's say the model predicts customer value, then the following query would order customers by their value and could be used to select the top customers (order-by elimination):

      SELECT *
      FROM  customers
      ORDER BY PREDICTION (svmR_model USING *);

The index can also help in cases where the prediction is used in the WHERE clause (index-driven filtering). For example, the following query returns customers with value above 150000:

      SELECT cust_name
      FROM customers
      WHERE PREDICTION (svmR_model USING *) > 150000;

Building Models
Oracle Data Mining has algorithms that provide powerful capabilities with very little work required from the user. It is possible to easily create models using these algorithms. As shown above, these models can then be used in arbitrarily complex queries, functional indexes, and materialized views. Two of these algorithms are: decision tree and support vector machine (SVM).

The decision tree algorithm can be used for generating profiles, predictions, and constructing hierarchies. It is a fast algorithm that provides easy to interpret rules (more information here and here). We can create a decision tree model as follows:

CREATE TABLE dt_settings (
setting_name  VARCHAR2(30),
setting_value VARCHAR2(30));

BEGIN
  -- Populate settings table
  INSERT INTO dt_sample_settings VALUES
    (dbms_data_mining.algo_name, dbms_data_mining.algo_decision_tree);
  COMMIT;

  DBMS_DATA_MINING.CREATE_MODEL(
    model_name          => 'sales_type_model',
    mining_function     => dbms_data_mining.classification,
    data_table_name     => 'sales_dataset',
    case_id_column_name => 'sales_id',
    target_column_name  => 'sales_type',
    settings_table_name => 'dt_settings');
END;

The SVM algorithm provides support for regression, time series, anomaly detection, and classification (more information here and here). We can create an SVM model for regression as follows:

BEGIN
  DBMSDATAMINING.CREATEMODEL(
    model_name          => 'sales_model',
    mining_function     => dbms_data_mining.regression,
    data_table_name     => 'sales_dataset',
    case_id_column_name => 'sales_id',
    target_column_name  => 'sales');
END;

In general, it is useful to normalize the data before building an SVM model. This can be done easily with one of the functions in the DBMS_DATA_MINING_TRANSFORM package or through a query. An example of the latter can be found here.

Finding Available Models
The list of models available for scoring can be obtained from the dm_user_models view. This view can be filtered to provide a list of model that match a given criterion. The following query lists the demo models available in the DM_USER schema.

SQL> SELECT 'NAME', 'FUNCTION_NAME', 'ALGORITHM_NAME' 
FROM DM_USER_MODELS;

NAME                  FUNCTION_NAME         ALGORITHM_NAME
--------------------------------------------------------------------
ABN_SH_CLAS_SAMPLE    CLASSIFICATION        ADAPTIVE_BAYES_NETWORK 
AI_SH_SAMPLE          ATTRIBUTE_IMPORTANCE  MINIMUM_DESCRIPTION_LENGTH
AR_SH_SAMPLE          ASSOCIATION_RULES     APRIORI_ASSOCIATION_RULES
DT_SH_CLAS_SAMPLE     CLASSIFICATION        DECISION_TREE            
KM_SH_CLUS_SAMPLE     CLUSTERING            KMEANS                   
NB_SH_CLAS_SAMPLE     CLASSIFICATION        NAIVE_BAYES              
NMF_SH_SAMPLE         FEATURE_EXTRACTION    NONNEGATIVE_MATRIX_FACTOR 
OC_SH_CLUS_SAMPLE     CLUSTERING            O_CLUSTER   
SVMC_SH_CLAS_SAMPLE   CLASSIFICATION        SUPPORT_VECTOR_MACHINES   
SVMO_SH_CLAS_SAMPLE   CLASSIFICATION        SUPPORT_VECTOR_MACHINES 
SVMR_SH_REGR_SAMPLE   REGRESSION            SUPPORT_VECTOR_MACHINES   
T_SVM_CLAS_SAMPLE     CLASSIFICATION        SUPPORT_VECTOR_MACHINES

Readings: Business intelligence, Data mining, Oracle analytics

This article looks good, but it assumes a certain level of understanding for data mining concepts. Can you recommend a link for data mining using Oracle "for dummies" or something like that? I guess I'm looking for somewhere to start looking at (using Oracle) and I'm not sure where to go.

Thanks and regards...Rich

Posted by Murnane | 6/06/2006 09:47:00 AM

It is true, the post starts off assuming that someone had already built models and can then use them in SQL queries. Later on I show how easy it is to create models. However, the bigger issue of when, why, and what type of models should one create in order to solve different problems is left unanswered. I am planning to write a series of posts discussing this. I see a great need for describing how many interesting tasks can be accomplished with analytics. I hope you may find that useful.

In the meantime, you may want to try the two tutorials pointed to in the ODM Tutorial #1 Download on KDnuggets post. The Oracle Data Mining Concepts is also a good starting point.

On the book front, the reading list on the sidebar of this blog has a link to a number of books on data mining. These are generic books and do not cover Oracle Data Mining. Many of the most popular books are included in that list. Currently, there is only one book out there on Oracle Data Mining. I have not had a chance to read it, so I cannot comment on it.

Posted by Marcos | 6/06/2006 09:00:00 PM

Thank you for the follow up Marcos.

Best regards...Rich

Posted by Murnane | 8/07/2006 09:08:00 AM

hi there ,

the 'attrite ', is it the target attribute ?

Posted by Anonymous | 9/19/2006 01:49:00 PM

'attrite" is one of the values the target attribute can take. The example assumes that the target can take on two values: ' attrite' and 'stay'.

Posted by Marcos | 9/19/2006 04:57:00 PM

I have never in my 15 years of working in data mining and analytics seen an attrition question such as the one you have posed answered using a simple query(Select all customers who have a high propensity to attrite (> 80% chance)). The question is a LOT more complex than that, and the answers should always involve a domain of experts. Not all answers to questions are SQL query oriented. What is valuable is the fact these mining tools are being moved closer to the warehouse repository, so as to avoid moving data back and forth from external packages such as SAS, or SPSS Clementine.
Thank you for the information you have supplied, it is a good starting point.

Ralph Winters

Posted by Ralph Winters | 5/11/2007 11:19:00 AM

Ralph,

Thanks for your comments. The first question was intended to be a simple example to showcase the basic functionality. The other examples build-up in complexity and illustrate that it is possible to build arbitrarily complex queries using this type of functionality. I agree that solution to specific problems need to reflect domain knowledge. This is true not only for data mining related scoring but to query and reporting in general as well.

Regarding your comment that not all questions are SQL query oriented, I think that it is a matter of context. In a BI settings (query and reporting) SQL rules and most (if not all) reporting tools speak SQL. Also when the goal is to integrate data mining in the operational system of a company two cases dominate: back-end processes and middle-tier process. The first usually involves a database and SQL is key. For the second case the answer is mixed. In some cases SQL plays no role and in others, when there is communication with a database, it has its place.

Your point about data movement is right on the mark. It has great implications for performance, specially during scoring. Keeping models and data in the database also simplifies moving models into production, improves security, and has a number of other benefits for the IT infrastructure. These are important considerations if we want to increase the impact of data mining. We need to think beyond the analyst and consider the whole picture (analysts and IT infrastructure).

Hope this help to clarify what I was going for with the examples.

Posted by Marcos | 5/12/2007 09:21:00 AM

Hi macros,

Your blog is really good....i found answers to lot of questions in pl/sql and data mining

look at my problem:

I want to override the default number of clusters (10) for the k-means clustering.

I have wrote

CREATE OR REPLACE procedure as cluster_build(table_name varchar2,case_id varchar2)
begin
execute immediate 'drop table cluster_settings ';
execute immediate 'CREATE TABLE cluster_settings (
setting_name VARCHAR2(30),
setting_value VARCHAR2(30))';
insert into cluster_settings values (DBMS_DATA_MINING.clus_num_clusters,'4');
DBMS_DATA_MINING.DROP_MODEL('KM_SAMPLE');
DBMS_DATA_MINING.CREATE_MODEL(
model_name => 'km_sample',
mining_function => dbms_data_mining.clustering,
data_table_name => table_name,
case_id_column_name => case_id,
settings_table_name => 'cluster_settings');
end;

-----------------------------
but when I ran the procedure , I got 10 clusters...

I used the query..

SELECT setting_name, setting_value
FROM TABLE(DBMS_DATA_MINING.GET_MODEL_SETTINGS('km_sample'))
ORDER BY setting_name

The output of the above query showed 10 in the CLUS_NUM_CLUSTERS row(setting_value).

how to overide that 10?...plz help me..i'm struct up in a project....

thanks,
swathi.

Posted by sv | 3/27/2008 06:15:00 AM

Swathi,

I replied to this question in the ODM forum as you have posted a similar question there. This way more people can benefit from the answer. Here is the link to the post. In a nutshell, I was able to get the correct behavior on a 11g database.

Posted by Marcos | 4/02/2008 06:16:00 AM

hi
most of the content here was available in the sql documentation..
ters no examples tat explains hw linear regression can be used in sql..pl help regarding the same...

-sam

Posted by Unknown | 3/24/2009 06:16:00 AM

Oracle Data Mining and Analytics

Blog on Data Mining and Analytics, with a special focus on Oracle. Analytics paves the way to transform databases into Knowledgebases.

Saturday, May 20, 2006

The SQL of Analytics 1 - Data Mining

About me

Disclaimer

Email-Digest

Feeds

Search

Series

Posts

Category Cloud

Links

Reading lists

Oracle Data Mining and Analytics

Blog on Data Mining and Analytics, with a special focus on Oracle. Analytics paves the way to transform databases into Knowledgebases.  <input name="u" value=http://oracledmt.blogspot.com/ type="hidden" />

Saturday, May 20, 2006

The SQL of Analytics 1 - Data Mining

var a = 10; if(a == 0) {document.write('No comment');} else if(a == 1) {document.write('1 comment');}else{document.write(a+' comments');}

About me

Disclaimer

Email-Digest

Feeds

Search

Series

Posts

Category Cloud

Links

Reading lists

Blog on Data Mining and Analytics, with a special focus on Oracle. Analytics paves the way to transform databases into Knowledgebases.