Oracle Data Mining and Analytics: Real-Time Scoring & Model Management 2

This is Part 2 in a series on large-scale real-time scoring and model management - The full series is Part 1, Part 2, and Part 3.

System Architecture
Consider a call center application that allows cross-selling products to a customer based on the customer historical data and information from the current call. The type of offer is conditional on the product category the customer is interested in during the call and the region where the customer lives. For example, a customer in Florida would be more likely to buy beach products than one in New England. As the customer service representative interacts with the customer, the application returns new product recommendations as new information becomes available during the call.

Figure 2: Architecture for a large-scale real-time scoring application.

A simplified architecture for such an application is illustrated in Figure 2. The system has four main components:

Client application
Filtering module
Scoring module
Learning module

The call center system initiates the interaction with the scoring system by sending the relevant information about the current call, which includes the customer ID. This information is used to filter down the available models at the server side to those that are meaningful to the current call. The customer ID is used to retrieve relevant historical data that is then combined with call information at the Scoring Module for scoring the models selected by the Filtering Module. The Scoring Module returns a ranked list of products to the call center application. This cycle can be repeated many times during a single call. At the end of the call, the call center application sends the information collected over the call to the server, including what offers were made and if they were accepted or not. This information is combined with historical information and used for training new models in the Learning Module. The Learning Module implements the following model management features discussed above:

Automated model build
Champion-challenger testing
Automated model deployment
Storage of model metadata

Implementation
Let's take a look how the Filtering and Scoring modules of the simplified architecture described above can be implemented using the technologies in the 10gR2 Oracle Database. To help illustrate the approach, I am providing implementation examples for each of the steps.

Filtering Module
The Filtering Module selects the set of models appropriated for a given call center call. Filtering is performed on a set of attributes associated with a model – business rules can also be used for defining the filtering criteria. Attributes can be associated with a model in two ways: 1) directly, in the model metadata table, and 2) indirectly, by joining the model metadata table with other tables. Model filtering is accomplished by querying the model metadata table and by applying the desired filtering criteria using a WHERE clause. The filtering criteria can refer to columns from other tables besides the model metadata table.

Consider that we have a model metadata table (modelmd_tab) with the following schema:

model_name       VARCHAR2(30),
model_type       VARCHAR2(30),
prod_id          NUMBER,
region           VARCHAR2(2),
model_accuracy   NUMBER

The Learning Module populates this table as models are created and deleted. The information in modelmd_tab identifies the model (name and type), the product and region the model was built for, and the model accuracy. Consider also that we have a product table (products) containing product information with the following schema:

prod_id            NUMBER,
prod_type          VARCHAR2(30),
prod_description   VARCHAR2(4000)

The following query selects cross-sell models for the northeast (NE) region:

SELECT A.model_name, A.prod_id
FROM   modelmd_tab A
WHERE  A.model_type = 'cross-sell' AND A.region = 'NE'

This approach to model filtering is very flexible. Other types of information can be used for filtering by joining the modelmd_tab table with other tables. For example, we could restrict the models returned by the previous query to those that belong to a given product category (e.g., credit-card) using the following query:

SELECT A.model_name, A.prod_id
FROM   modelmd_tab A,
       (SELECT prod_id
        FROM   products
        WHERE  prod_type = 'credit-card') B
WHERE  A.model_type = 'cross-sell' AND
       A.region = 'NE' AND
       A.prod_id = B.prod_id

By taking advantage of Oracle Spatial and Oracle Text capabilities, this approach can easily accommodate filtering using spatial and textual information.

Scoring Module
This module scores the set of models selected by the Filtering Module using historical customer data and/or call information. In our example, it returns a ranked list of products. The implementation proposed here takes advantage of temporary session tables and model caching for performance.

A temporary session table is used for holding the scoring results for a given request from the client. Before scoring each new request the table is truncated.

For reasonably sized models, caching is achieved through shared cursors. In order to avoid reloading the model every time, this requires the same SQL string to be used for all queries that score a given model. For example, running the following scoring query multiple times would not reload the model tree-model after the first time:

SELECT PREDICTION(tree_model USING A.*)
FROM   customers A
WHERE  A.cust_id = '1001'

However, if we modify the above SQL string in any way the model would be re-loaded. By using binding variables in the scoring query, it is possible to the get the benefit of caching without sacrificing flexibility:

SELECT PREDICTION(tree_model USING A.*)
FROM   customers A
WHERE  A.cust_id = :1

Another benefit of model caching is a reduction in shared pool memory usage, enabling greater concurrency and larger multi-user loads.

In order to implement the Scoring Module, I first created a temporary session table to hold the results. This is done at the time the system is setup. For this example, the table was created as follow:

CREATE GLOBAL TEMPORARY TABLE scoretab(
prod_id   NUMBER,
prob      NUMBER) ON COMMIT PRESERVE ROWS;

For each score request, scoretab will store a row per each eligible model selected by the Filtering Module. Each row has the ID of the product associated with the model and the probability of the customer buying that product.

Next, I created a PL/SQL stored procedure that filters and then scores all the eligible models. For simplicity, I decided to bundle the filtering and scoring modules in a single PL/SQL procedure:

CREATE OR REPLACE PROCEDURE score_multimodel(
  p_cid NUMBER,
  p_model_table_name  VARCHAR2,
  p_score_table_name  VARCHAR2,
  p_prod_where_clause VARCHAR2,
  p_mod_where_clause  VARCHAR2) AS
TYPE Char_Tab           IS TABLE OF VARCHAR2(30);
TYPE Num_Tab            IS TABLE OF NUMBER;
TYPE ModelCursorType    IS REF CURSOR;
v_model_tab             Char_Tab;
v_prod_tab              Num_Tab;
v_cursor_model          ModelCursorType;
v_mod_where             VARCHAR(4000) := p_mod_where_clause;
v_prod_where            VARCHAR(4000) := p_prod_where_clause;
v_gender                CHAR(1);
v_year                  NUMBER;
v_marital               VARCHAR2(20);
v_credit                NUMBER;
v_sql_stmt0             VARCHAR2(4000);
v_sql_stmt              VARCHAR2(4000);
v_num_models            NUMBER;
BEGIN
  -- Clear result table
  v_sql_stmt := 'TRUNCATE TABLE ' || p_score_table_name;
  execute immediate v_sql_stmt;

  -- Get customer information (NOTE: this assumes a fixed schema)
  SELECT CUST_GENDER, CUST_YEAR_OF_BIRTH, CUST_MARITAL_STATUS,
         CUST_CREDIT_LIMIT
  INTO v_gender, v_year, v_marital, v_credit
  FROM customers WHERE cust_id = p_cid;

  -- Open cursor and bulk read to get eligible models
  v_sql_stmt0 := 'SELECT prod_id FROM products ';

  IF length(v_prod_where) > 0 THEN
    v_sql_stmt0 := v_sql_stmt0  || ' WHERE ' || v_prod_where;
  END IF;

  v_sql_stmt := 'SELECT A.model_name, A.prod_id ' ||
                'FROM ' || p_model_table_name || ' A,' ||
                '(' || v_sql_stmt0 || ') B '  ||
                'WHERE A.prod_id = B.prod_id ';

  IF length(v_mod_where) > 0 THEN
    v_sql_stmt := v_sql_stmt  || ' AND ' || v_mod_where;
  END IF;
  OPEN v_cursor_model FOR v_sql_stmt;

  LOOP
    FETCH v_cursor_model BULK COLLECT INTO v_model_tab, v_prod_tab;
    EXIT WHEN v_cursor_model%NOTFOUND;
  END LOOP;
  CLOSE v_cursor_model;

  -- Score each model and persist results to temporary table
  v_num_models :=  v_model_tab.count;
  FOR i IN 1..v_num_models LOOP
    v_sql_stmt := 
        'INSERT INTO ' || p_score_table_name ||
        ' SELECT :1, prediction_probability('|| v_model_tab(i) ||
        '   USING :2 as CUST_GENDER, :3 as CUST_YEAR_OF_BIRTH, ' ||
        '   :4 as CUST_MARITAL_STATUS, :5 as CUST_CREDIT_LIMIT) '||
        ' FROM dual ' ||
        ' WHERE prediction(' || v_model_tab(i) ||
        '   USING :6 as CUST_GENDER, :7 as CUST_YEAR_OF_BIRTH, ' ||
        '   :8 as CUST_MARITAL_STATUS, :9 as CUST_CREDIT_LIMIT)=1';
    EXECUTE IMMEDIATE v_sql_stmt USING v_prod_tab(i), 
                      v_gender, v_year, v_marital, v_credit,
                      v_gender, v_year, v_marital, v_credit;
  END LOOP;
END score_multimodel;
/
show errors;
COMMIT;

The score_multimodel procedure takes as inputs the customer ID, for retrieval of historical data, the model metadata name, the temporary session table name, for persisting results, a WHERE clause for product filtering, and a WHERE clause for model filtering. The procedure scores all models associated with the products filtered by the WHERE clauses. Models are scored one at a time and results persisted to the temporary session table. Scoring each model separately and using binding variables allow for model caching through shared cursors.

Consider a product with prod_id = 120, an example of the scoring query in the inner loop of the procedure is as follow:

SELECT 120, 
       PREDICTION_PROBABILITY(model1 USING 'M' AS cust_gender, 
                                          1960 AS cust_year_of_birth,
                                     'Married' AS cust_marital_status, 
                                         50000 AS cust_credit_limit)
FROM  dual
WHERE PREDICTION(model1 USING 'M' AS cust_gender, 
                             1960 AS cust_year_of_birth,
                        'Married' AS cust_marital_status,                                         
                            50000 AS cust_credit_limit) = 1

This scoring query only returns rows for products that the customer is likely to buy. The outcomes a model can predict are a function of the data provided to the model during training. For this example, a model predicts one of two outcomes: 0, if the customer would not buy the product, and 1, if the customer would buy the product. The use of the PREDICTION operator in the WHERE clause removes, from the result set, products the customer is unlikely to buy.

In the above example procedure, the schema for the customer historical information is hard-coded. The customer information is stored in the customer table and the following columns are used for scoring a model: cust_gender, cust_year_of_birth, cust_marital_status, cust_credit_limit. Usually, this would not be a problem as many applications work against a fixed schema. Also, in this example, there is no call data passed to the procedure. This type of information can be easily added by either passing it directly as arguments to the procedure or through a different session table that is accessed inside the procedure.

A typical call to score_multimodel would be something like this:

BEGIN
   score_multimodel(1001, 'modelmd_tab', 'scoretab', 
                    'prod_type = ''credit-card''',
                    'model_type = ''cross-sell'' AND region = ''NE''');
END;

where 1001 is the customer ID, modelmd_tab is the name of the model metadata table, scoretab is the temporary session table, the product filtering condition prod_type = 'credit-card' restricts the recommendations to credit-card products, and the model filtering only allows scoring of cross-sell models for the NE region. A typical output for the procedure would be:

prod_id      prob
-------  --------
    100       0.6
    130       0.4
     90       0.8

The client application would, in general, rank the results by probability and take the top N probabilities. For example, for N=10 we would have:

SELECT A.*
FROM   (SELECT prod_id, prob FROM scoretab ORDER BY prob DESC) A
WHERE  rownum < 11

The next post in this series gives performance numbers for the system. The full series is Part 1, Part 2, and Part 3.

Readings: Business intelligence, Data mining, Oracle analytics

Labels: Model Management, Real-Time

Oracle Data Mining and Analytics

Blog on Data Mining and Analytics, with a special focus on Oracle. Analytics paves the way to transform databases into Knowledgebases.

Tuesday, February 21, 2006

Real-Time Scoring & Model Management 2 - Implementation

About me

Disclaimer

Email-Digest

Feeds

Search

Series

Posts

Category Cloud

Links

Reading lists