« Home | Trusted Analytics - Bean Counters vs. Experts » | Time Series Forecasting 3 - Multi-step Forecasting... » | Oracle Data Mining Tutorial and White Paper » | Explaining Predictions with Easy to Read Tree Rule... » | Bioinformatics the SQL Way » | Now Available! - Oracle Data Miner Release 10g Rel... » | Automating Pivot Queries » | Time Series Forecasting 2 - Single-step Forecastin... » | The Democratization of Business Intelligence » | Running Oracle Data Miner on the Mac » 

Saturday, May 20, 2006 

The SQL of Analytics 1 - Data Mining

What do we get when we combine the power of SQL with advanced analytics? The answer: a revolution. The Oracle RDBMS has added, over recent releases, support for advanced analytics to the SQL vocabulary. Because analytics is now part of the vocabulary, it can be used to answer all types of questions. Below I show examples of questions that we can ask using this new vocabulary. At the end I also show how easy it is to create models and how to find out the name of the models available to a user.

This is Part 1 in a series on analytics and SQL. This post covers data mining. A future post will cover statistical functions.

The combination of simplified model creation (see below) with the flexibility of SQL is a powerful tool for developing intelligent applications. It moves analytics away from the domain of experts, capable of navigating complex methodologies, into the hands of database and application developers. Developers can now use their creativity to combine and leverage data mining results in a myriad of ways. This changes the dynamics of how data mining is used in applications and opens up the opportunity for an increase in the number of "intelligent" applications. That is why I believe this is revolutionary.

My goal with the following examples is to illustrate how easy it is to use data mining results to create interesting and powerful queries. These types of queries make it straightforward for developers to add data mining results to applications and reports. The examples progress from simple to more complex queries.

Simple Prediction Query

Select all customers who have a high propensity to attrite (> 80% chance)
SQL Query:
      SELECT A.cust_name, A.contact_info
FROM customers A
USING A.*) > 0.8

This question can be answered with a simple predictive SQL query. The PREDICTION_PROBABILITY SQL function computes the probability that a customer will attrite using a customer attrition model called tree_model. This probability is used in the WHERE clause to select customers with a high propensity to attrite (probability greater than 80%). The customer attrition model used in the query was built with the type of information present in the customers table. The USING clause allows the * wildcard. It takes all the columns in customers and utilizes only the ones required by model tree_model.

Incorporating Other Relational Information
Select all customers who have a high propensity to attrite (> 80% chance) and have a customer value rating of more than 90
SQL Query:
      SELECT A.cust_name, A.contact_info
FROM customers A
USING A.*) > 0.8
AND A.cust_value > 90
It is also very easy to answer more complex questions by incorporating other relational information. In this example, customers are further filtered based on the information in the cust_value column.

Multi-domain, Multi-database Analysis
Select all customers who have a high propensity to attrite (> 80% chance) and have a customer value rating of more than 90 and have had a recent conversation with customer service regarding a Checking Plus account.
SQL Query:
     SELECT A.cust_name, A.contact_info
FROM customers A
USING A.*) > 0.8
AND A.cust_value > 90
AND A.cust_id IN
(SELECT B.cust_id
FROM call_center@HQ_DB B
WHERE B.call_date BETWEEN ’01-Jan-2005’
AND ’30-Jun-2005’
AND CONTAINS(B.notes, ‘Checking Plus’, 1) > 0)

This example builds upon the previous query and shows how the many capabilities of the RDBMS can be combined to answer complex questions. The query combines the previous results with call center information from the table call_center residing in a remote database. The remote database is accessed through a database link (HQ_DB). The query also uses Oracle Text CONTAINS operator to select customers that have mentioned 'Checking Plus' in their conversations with customer services within a six-months time window.

Gaining Further Insight
Select all customers who have a high propensity to attrite (> 80% chance) and have a customer value rating of more than 90 and have had a recent conversation with customer service regarding a Checking Plus account.
BTW, also tell me how you got this prediction.
SQL Query:
      SELECT A.cust_name, A.contact_info,
FROM customers A
USING A.*) > 0.8
AND A.cust_value > 90
AND A.cust_id IN
(SELECT B.cust_id
FROM call_center B
WHERE B.call_date BETWEEN ’01-Jan-2005’
AND ’30-Jun-2005’
AND CONTAINS(B.notes, ‘Checking Plus’, 1) > 0);
This query adds an explanation, in the form of the rule used to make the decision, to the previous query. The function PREDICTION_DETAILS returns the node ID for the rule used in the decision. The actual rule can be obtained using the function DBMS_DATA_MINING.GET_MODEL_DETAILS_XML.

Descriptive and Predictive SQL - Combining Multiple Models
For each segment based on our customer segmentation model, provide the number of customers in that segment as well as an average score for customer value and an average probability to attrite. Return this aggregate information ordered based on the overall segments’ propensity to attrite.
SQL Query:
      SELECT count(*) as cnt,
avg(PREDICTION_PROBABILITY(svmC_model, ‘attrite’
USING *)) as avg_attrite,

FROM customers
ORDER BY avg_attrite DESC;
This example introduces another data mining SQL function (CLUSTER_ID). It shows how we can combine models of different types to answer a larger array of interesting questions.

Function Composition (e.g., Missing Value Imputation)
Select the 10 customers who are most likely to attrite based solely on: age, gender, annual_income, and zipcode.

In addition, since annual_income is often missing, perform missing value imputation for the annual_income attribute using all of the customer demographics.
SQL Query:
      SELECT *
FROM (SELECT cust_name, cust_contact_info
FROM customers
tree_model, 'attrite'
USING age, gender, zipcode,
) AS annual_income) DESC

WHERE rownum < 11;
This example orders customers by their probability to attrite using an attrition model (tree_model). It also uses another model (svmR) for missing value imputation for the annual_income attribute.

Auxiliary/Dependent Data Structures
Data mining models can also be used in auxiliary/dependent data structures, for example:
  • Materialized views
  • Functional indexes
This allows pre-computation of results for faster response time. The dependencies between a model and the dependent structures are also maintained. For example, if we drop a model used in a materialized view, the view is invalidated.

A functional index using a data mining model (svmR) can be created as follows:
      CREATE INDEX anninc_idx on customers
As new data are added to the customers table the model is scored and the results are stored in the index. The index can then be used to speed up queries on customers that require the model's prediction. For example, let's say the model predicts customer value, then the following query would order customers by their value and could be used to select the top customers (order-by elimination):
      SELECT *
FROM customers
The index can also help in cases where the prediction is used in the WHERE clause (index-driven filtering). For example, the following query returns customers with value above 150000:
      SELECT cust_name
FROM customers
WHERE PREDICTION (svmR_model USING *) > 150000;
Building Models
Oracle Data Mining has algorithms that provide powerful capabilities with very little work required from the user. It is possible to easily create models using these algorithms. As shown above, these models can then be used in arbitrarily complex queries, functional indexes, and materialized views. Two of these algorithms are: decision tree and support vector machine (SVM).

The decision tree algorithm can be used for generating profiles, predictions, and constructing hierarchies. It is a fast algorithm that provides easy to interpret rules (more information here and here). We can create a decision tree model as follows:
CREATE TABLE dt_settings (
setting_name VARCHAR2(30),
setting_value VARCHAR2(30));

-- Populate settings table
INSERT INTO dt_sample_settings VALUES
(dbms_data_mining.algo_name, dbms_data_mining.algo_decision_tree);

model_name => 'sales_type_model',
mining_function => dbms_data_mining.classification,
data_table_name => 'sales_dataset',
case_id_column_name => 'sales_id',
target_column_name => 'sales_type',
settings_table_name => 'dt_settings');
The SVM algorithm provides support for regression, time series, anomaly detection, and classification (more information here and here). We can create an SVM model for regression as follows:
model_name => 'sales_model',
mining_function => dbms_data_mining.regression,
data_table_name => 'sales_dataset',
case_id_column_name => 'sales_id',
target_column_name => 'sales');
In general, it is useful to normalize the data before building an SVM model. This can be done easily with one of the functions in the DBMS_DATA_MINING_TRANSFORM package or through a query. An example of the latter can be found here.

Finding Available Models

The list of models available for scoring can be obtained from the dm_user_models view. This view can be filtered to provide a list of model that match a given criterion. The following query lists the demo models available in the DM_USER schema.


Readings: Business intelligence, Data mining, Oracle analytics

This article looks good, but it assumes a certain level of understanding for data mining concepts. Can you recommend a link for data mining using Oracle "for dummies" or something like that? I guess I'm looking for somewhere to start looking at (using Oracle) and I'm not sure where to go.

Thanks and regards...Rich

It is true, the post starts off assuming that someone had already built models and can then use them in SQL queries. Later on I show how easy it is to create models. However, the bigger issue of when, why, and what type of models should one create in order to solve different problems is left unanswered. I am planning to write a series of posts discussing this. I see a great need for describing how many interesting tasks can be accomplished with analytics. I hope you may find that useful.

In the meantime, you may want to try the two tutorials pointed to in the ODM Tutorial #1 Download on KDnuggets post. The Oracle Data Mining Concepts is also a good starting point.

On the book front, the reading list on the sidebar of this blog has a link to a number of books on data mining. These are generic books and do not cover Oracle Data Mining. Many of the most popular books are included in that list. Currently, there is only one book out there on Oracle Data Mining. I have not had a chance to read it, so I cannot comment on it.

Thank you for the follow up Marcos.

Best regards...Rich

hi there ,

the 'attrite ', is it the target attribute ?

'attrite" is one of the values the target attribute can take. The example assumes that the target can take on two values: ' attrite' and 'stay'.

I have never in my 15 years of working in data mining and analytics seen an attrition question such as the one you have posed answered using a simple query(Select all customers who have a high propensity to attrite (> 80% chance)). The question is a LOT more complex than that, and the answers should always involve a domain of experts. Not all answers to questions are SQL query oriented. What is valuable is the fact these mining tools are being moved closer to the warehouse repository, so as to avoid moving data back and forth from external packages such as SAS, or SPSS Clementine.
Thank you for the information you have supplied, it is a good starting point.

Ralph Winters


Thanks for your comments. The first question was intended to be a simple example to showcase the basic functionality. The other examples build-up in complexity and illustrate that it is possible to build arbitrarily complex queries using this type of functionality. I agree that solution to specific problems need to reflect domain knowledge. This is true not only for data mining related scoring but to query and reporting in general as well.

Regarding your comment that not all questions are SQL query oriented, I think that it is a matter of context. In a BI settings (query and reporting) SQL rules and most (if not all) reporting tools speak SQL. Also when the goal is to integrate data mining in the operational system of a company two cases dominate: back-end processes and middle-tier process. The first usually involves a database and SQL is key. For the second case the answer is mixed. In some cases SQL plays no role and in others, when there is communication with a database, it has its place.

Your point about data movement is right on the mark. It has great implications for performance, specially during scoring. Keeping models and data in the database also simplifies moving models into production, improves security, and has a number of other benefits for the IT infrastructure. These are important considerations if we want to increase the impact of data mining. We need to think beyond the analyst and consider the whole picture (analysts and IT infrastructure).

Hope this help to clarify what I was going for with the examples.

Hi macros,

Your blog is really good....i found answers to lot of questions in pl/sql and data mining

look at my problem:

I want to override the default number of clusters (10) for the k-means clustering.

I have wrote

CREATE OR REPLACE procedure as cluster_build(table_name varchar2,case_id varchar2)
execute immediate 'drop table cluster_settings ';
execute immediate 'CREATE TABLE cluster_settings (
setting_name VARCHAR2(30),
setting_value VARCHAR2(30))';
insert into cluster_settings values (DBMS_DATA_MINING.clus_num_clusters,'4');
model_name => 'km_sample',
mining_function => dbms_data_mining.clustering,
data_table_name => table_name,
case_id_column_name => case_id,
settings_table_name => 'cluster_settings');

but when I ran the procedure , I got 10 clusters...

I used the query..

SELECT setting_name, setting_value
ORDER BY setting_name

The output of the above query showed 10 in the CLUS_NUM_CLUSTERS row(setting_value).

how to overide that 10?...plz help me..i'm struct up in a project....



I replied to this question in the ODM forum as you have posted a similar question there. This way more people can benefit from the answer. Here is the link to the post. In a nutshell, I was able to get the correct behavior on a 11g database.

most of the content here was available in the sql documentation..
ters no examples tat explains hw linear regression can be used in sql..pl help regarding the same...


Post a Comment

Links to this post

Create a Link

About me

  • Marcos M. Campos: Development Manager for Oracle Data Mining Technologies. Previously Senior Scientist with Thinking Machines. Over the years I have been working on transforming databases into easy to use analytical servers.
  • My profile


  • Opinions expressed are entirely my own and do not reflect the position of Oracle or any other corporation. The views and opinions expressed by visitors to this blog are theirs and do not necessarily reflect mine.
  • This work is licensed under a Creative Commons license.
  • Creative Commons License





All Posts

Category Cloud


Locations of visitors to this page
Powered by Blogger
Get Firefox