« Home | Real-Time Scoring & Model Management 3 - Performance » | Oracle Life Sciences Meeting » | Improved Stock Research Through Analytics? » | Multi-touch Screens - The Next Input Device? » | Poll - How do you work with analytics? » | Competing on Analytics » | SPSS and Inforsense Support for Oracle Data Mining » | Blog Changes » | A Couple of Papers on Oracle Analytics Available o... » | Time Series Forecasting - Part 1 » 

Tuesday, February 21, 2006 

Real-Time Scoring & Model Management 2 - Implementation

This is Part 2 in a series on large-scale real-time scoring and model management - The full series is Part 1, Part 2, and Part 3.

System Architecture
Consider a call center application that allows cross-selling products to a customer based on the customer historical data and information from the current call. The type of offer is conditional on the product category the customer is interested in during the call and the region where the customer lives. For example, a customer in Florida would be more likely to buy beach products than one in New England. As the customer service representative interacts with the customer, the application returns new product recommendations as new information becomes available during the call.

Architecture for a large scale real-time scoring application


Figure 2: Architecture for a large-scale real-time scoring application.

A simplified architecture for such an application is illustrated in Figure 2. The system has four main components:
  1. Client application
  2. Filtering module
  3. Scoring module
  4. Learning module
The call center system initiates the interaction with the scoring system by sending the relevant information about the current call, which includes the customer ID. This information is used to filter down the available models at the server side to those that are meaningful to the current call. The customer ID is used to retrieve relevant historical data that is then combined with call information at the Scoring Module for scoring the models selected by the Filtering Module. The Scoring Module returns a ranked list of products to the call center application. This cycle can be repeated many times during a single call. At the end of the call, the call center application sends the information collected over the call to the server, including what offers were made and if they were accepted or not. This information is combined with historical information and used for training new models in the Learning Module. The Learning Module implements the following model management features discussed above:
  • Automated model build
  • Champion-challenger testing
  • Automated model deployment
  • Storage of model metadata
Implementation
Let's take a look how the Filtering and Scoring modules of the simplified architecture described above can be implemented using the technologies in the 10gR2 Oracle Database. To help illustrate the approach, I am providing implementation examples for each of the steps.

Filtering Module

The Filtering Module selects the set of models appropriated for a given call center call. Filtering is performed on a set of attributes associated with a model – business rules can also be used for defining the filtering criteria. Attributes can be associated with a model in two ways: 1) directly, in the model metadata table, and 2) indirectly, by joining the model metadata table with other tables. Model filtering is accomplished by querying the model metadata table and by applying the desired filtering criteria using a WHERE clause. The filtering criteria can refer to columns from other tables besides the model metadata table.

Consider that we have a model metadata table (modelmd_tab) with the following schema:
model_name       VARCHAR2(30),
model_type VARCHAR2(30),
prod_id NUMBER,
region VARCHAR2(2),
model_accuracy NUMBER
The Learning Module populates this table as models are created and deleted. The information in modelmd_tab identifies the model (name and type), the product and region the model was built for, and the model accuracy. Consider also that we have a product table (products) containing product information with the following schema:
prod_id            NUMBER,
prod_type VARCHAR2(30),
prod_description VARCHAR2(4000)
The following query selects cross-sell models for the northeast (NE) region:
SELECT A.model_name, A.prod_id
FROM modelmd_tab A
WHERE A.model_type = 'cross-sell' AND A.region = 'NE'
This approach to model filtering is very flexible. Other types of information can be used for filtering by joining the modelmd_tab table with other tables. For example, we could restrict the models returned by the previous query to those that belong to a given product category (e.g., credit-card) using the following query:
SELECT A.model_name, A.prod_id
FROM modelmd_tab A,
(SELECT prod_id
FROM products
WHERE prod_type = 'credit-card') B
WHERE A.model_type = 'cross-sell' AND
A.region = 'NE' AND
A.prod_id = B.prod_id
By taking advantage of Oracle Spatial and Oracle Text capabilities, this approach can easily accommodate filtering using spatial and textual information.

Scoring Module
This module scores the set of models selected by the Filtering Module using historical customer data and/or call information. In our example, it returns a ranked list of products. The implementation proposed here takes advantage of temporary session tables and model caching for performance.

A temporary session table is used for holding the scoring results for a given request from the client. Before scoring each new request the table is truncated.

For reasonably sized models, caching is achieved through shared cursors. In order to avoid reloading the model every time, this requires the same SQL string to be used for all queries that score a given model. For example, running the following scoring query multiple times would not reload the model tree-model after the first time:
SELECT PREDICTION(tree_model USING A.*)
FROM customers A
WHERE A.cust_id = '1001'
However, if we modify the above SQL string in any way the model would be re-loaded. By using binding variables in the scoring query, it is possible to the get the benefit of caching without sacrificing flexibility:
SELECT PREDICTION(tree_model USING A.*)
FROM customers A
WHERE A.cust_id = :1
Another benefit of model caching is a reduction in shared pool memory usage, enabling greater concurrency and larger multi-user loads.

In order to implement the Scoring Module, I first created a temporary session table to hold the results. This is done at the time the system is setup. For this example, the table was created as follow:
CREATE GLOBAL TEMPORARY TABLE scoretab(
prod_id NUMBER,
prob NUMBER) ON COMMIT PRESERVE ROWS;
For each score request, scoretab will store a row per each eligible model selected by the Filtering Module. Each row has the ID of the product associated with the model and the probability of the customer buying that product.

Next, I created a PL/SQL stored procedure that filters and then scores all the eligible models. For simplicity, I decided to bundle the filtering and scoring modules in a single PL/SQL procedure:
CREATE OR REPLACE PROCEDURE score_multimodel(
p_cid NUMBER,
p_model_table_name VARCHAR2,
p_score_table_name VARCHAR2,
p_prod_where_clause VARCHAR2,
p_mod_where_clause VARCHAR2) AS
TYPE Char_Tab IS TABLE OF VARCHAR2(30);
TYPE Num_Tab IS TABLE OF NUMBER;
TYPE ModelCursorType IS REF CURSOR;
v_model_tab Char_Tab;
v_prod_tab Num_Tab;
v_cursor_model ModelCursorType;
v_mod_where VARCHAR(4000) := p_mod_where_clause;
v_prod_where VARCHAR(4000) := p_prod_where_clause;
v_gender CHAR(1);
v_year NUMBER;
v_marital VARCHAR2(20);
v_credit NUMBER;
v_sql_stmt0 VARCHAR2(4000);
v_sql_stmt VARCHAR2(4000);
v_num_models NUMBER;
BEGIN
-- Clear result table
v_sql_stmt := 'TRUNCATE TABLE ' || p_score_table_name;
execute immediate v_sql_stmt;

-- Get customer information (NOTE: this assumes a fixed schema)
SELECT CUST_GENDER, CUST_YEAR_OF_BIRTH, CUST_MARITAL_STATUS,
CUST_CREDIT_LIMIT
INTO v_gender, v_year, v_marital, v_credit
FROM customers WHERE cust_id = p_cid;

-- Open cursor and bulk read to get eligible models
v_sql_stmt0 := 'SELECT prod_id FROM products ';

IF length(v_prod_where) > 0 THEN
v_sql_stmt0 := v_sql_stmt0 || ' WHERE ' || v_prod_where;
END IF;

v_sql_stmt := 'SELECT A.model_name, A.prod_id ' ||
'FROM ' || p_model_table_name || ' A,' ||
'(' || v_sql_stmt0 || ') B ' ||
'WHERE A.prod_id = B.prod_id ';

IF length(v_mod_where) > 0 THEN
v_sql_stmt := v_sql_stmt || ' AND ' || v_mod_where;
END IF;
OPEN v_cursor_model FOR v_sql_stmt;

LOOP
FETCH v_cursor_model BULK COLLECT INTO v_model_tab, v_prod_tab;
EXIT WHEN v_cursor_model%NOTFOUND;
END LOOP;
CLOSE v_cursor_model;

-- Score each model and persist results to temporary table
v_num_models := v_model_tab.count;
FOR i IN 1..v_num_models LOOP
v_sql_stmt :=
'INSERT INTO ' || p_score_table_name ||
' SELECT :1, prediction_probability('|| v_model_tab(i) ||
' USING :2 as CUST_GENDER, :3 as CUST_YEAR_OF_BIRTH, ' ||
' :4 as CUST_MARITAL_STATUS, :5 as CUST_CREDIT_LIMIT) '||
' FROM dual ' ||
' WHERE prediction(' || v_model_tab(i) ||
' USING :6 as CUST_GENDER, :7 as CUST_YEAR_OF_BIRTH, ' ||
' :8 as CUST_MARITAL_STATUS, :9 as CUST_CREDIT_LIMIT)=1';
EXECUTE IMMEDIATE v_sql_stmt USING v_prod_tab(i),
v_gender, v_year, v_marital, v_credit,
v_gender, v_year, v_marital, v_credit;
END LOOP;
END score_multimodel;
/
show errors;
COMMIT;
The score_multimodel procedure takes as inputs the customer ID, for retrieval of historical data, the model metadata name, the temporary session table name, for persisting results, a WHERE clause for product filtering, and a WHERE clause for model filtering. The procedure scores all models associated with the products filtered by the WHERE clauses. Models are scored one at a time and results persisted to the temporary session table. Scoring each model separately and using binding variables allow for model caching through shared cursors.

Consider a product with prod_id = 120, an example of the scoring query in the inner loop of the procedure is as follow:
SELECT 120, 
PREDICTION_PROBABILITY(model1 USING 'M' AS cust_gender,
1960 AS cust_year_of_birth,
'Married' AS cust_marital_status,
50000 AS cust_credit_limit)
FROM dual
WHERE PREDICTION(model1 USING 'M' AS cust_gender,
1960 AS cust_year_of_birth,
'Married' AS cust_marital_status,
50000 AS cust_credit_limit) = 1
This scoring query only returns rows for products that the customer is likely to buy. The outcomes a model can predict are a function of the data provided to the model during training. For this example, a model predicts one of two outcomes: 0, if the customer would not buy the product, and 1, if the customer would buy the product. The use of the PREDICTION operator in the WHERE clause removes, from the result set, products the customer is unlikely to buy.

In the above example procedure, the schema for the customer historical information is hard-coded. The customer information is stored in the customer table and the following columns are used for scoring a model: cust_gender, cust_year_of_birth, cust_marital_status, cust_credit_limit. Usually, this would not be a problem as many applications work against a fixed schema. Also, in this example, there is no call data passed to the procedure. This type of information can be easily added by either passing it directly as arguments to the procedure or through a different session table that is accessed inside the procedure.

A typical call to score_multimodel would be something like this:
BEGIN
score_multimodel(1001, 'modelmd_tab', 'scoretab',
'prod_type = ''credit-card''',
'model_type = ''cross-sell'' AND region = ''NE''');
END;
where 1001 is the customer ID, modelmd_tab is the name of the model metadata table, scoretab is the temporary session table, the product filtering condition prod_type = 'credit-card' restricts the recommendations to credit-card products, and the model filtering only allows scoring of cross-sell models for the NE region. A typical output for the procedure would be:
prod_id      prob
------- --------
100 0.6
130 0.4
90 0.8
The client application would, in general, rank the results by probability and take the top N probabilities. For example, for N=10 we would have:
SELECT A.*
FROM (SELECT prod_id, prob FROM scoretab ORDER BY prob DESC) A
WHERE rownum < 11
The next post in this series gives performance numbers for the system. The full series is Part 1, Part 2, and Part 3.

Readings: Business intelligence, Data mining, Oracle analytics

Labels: ,

About me

  • Marcos M. Campos: Development Manager for Oracle Data Mining Technologies. Previously Senior Scientist with Thinking Machines. Over the years I have been working on transforming databases into easy to use analytical servers.
  • My profile

Disclaimer

  • Opinions expressed are entirely my own and do not reflect the position of Oracle or any other corporation. The views and opinions expressed by visitors to this blog are theirs and do not necessarily reflect mine.
  • This work is licensed under a Creative Commons license.
  • Creative Commons License

Email-Digest



Feeds

Search


Posts

All Posts

Category Cloud

Links

Locations of visitors to this page
Powered by Blogger
Get Firefox