Thursday, February 18, 2010 

Funny YouTube Video Featuring Oracle Data Mining

Maybe I am too much of a data mining geek, but I found the video below to be funny. It also talks about a super cool feature ODM introduced in 11.2: the ability of scoring data mining models at the disk controller level in Exadata. This is a significant performance booster. It also makes it feasible to produce actionable insights from massive amounts of data extremely fast. More on this on a future post. For now enjoy the video.

Labels: ,

Oracle Data Mining Races with America's Cup

For those that have not heard the BMW Oracle Racing team won the America's Cup sailing an incredible new boat. What even those that have been following the news on the race do not know is that Oracle Data Mining helped the performance team tune the boat.

I participated helping with that problem and it was a very hard one:
Imagine standing under an avalanche of data - 2500 variables, 10 times per second and a sailing team demanding answers to design and sailing variations immediately. This was the challenge facing the BMW ORACLE Racing Performance Analysis Team every sailing day as they refined and improved their giant 90 foot wide, 115 foot long trimaran sporting the largest hard-sail wing ever made. Using ORACLE DATA MINING accessing an ORACLE DATABASE and presenting results real time using ORACLE APPLICATION EXPRESS the performance team managed to provide the information required to optimise the giant multihull to the point that it not only beat the reigning America's Cup champions Alinghi in their giant Catamaran but resoundingly crushed them in a power display of high speed sailing. After two races - and two massive winning margins - the America's Cup was heading back to America - a triumph for the team, ORACLE and American technology.
— Ian Burns, Performance Director, BMW ORACLE Racing Team
Here is a video showcasing the boat and give some interesting information about it:


For more information check out the BMW Oracle Racing site here.

Labels: ,

Tuesday, March 24, 2009 

Data Mining Survey - Last Call

Rexer Analytics has just issued a last call for its annual data mining survey. This is a pretty nice survey that provides a great deal of valuable information about how data mining is used and who is doing it.

To participate, please click on the link below and enter the access code in the space provided. The survey should take approximately 20 minutes to complete.  At the end of the survey you can request to have the results sent to you as well as get a copy of last year's survey.

Survey Link: www.RexerAnalytics.com/Data-Miner-Survey.html
Access Code: RS2008

Labels:

Monday, November 03, 2008 

Job Opportunities

The Oracle Data Mining group has two openings for talented software developers and machine learning experts. Take a look at these links (IRC1133401, IRC1143559) for more details.

Labels:

Tuesday, October 28, 2008 

Oracle BIWA Summit 2008


The Oracle BIWA Summit 2008 is approaching  (December 2-3) . It will be held at Oracle World HQ, Redwood Shores, California. This is the second event of its kind. Last year's event was a great success and lots of fun (see details here ). This year's keynotes include Jeanne Harris (co-author of "Competing on Analytics") and Usama Fayyad (legendary data miner).  Here are some information and links about the event:

Sign up now to attend the Oracle BIWA Summit 2008 Dec. 2-3.  Attend this unique two-day IOUG Business Intelligence, Warehousing and Analytics (BIWA) SIG (www.oraclebiwa.org) event to gain the knowledge and information critical for success in your work.  Attend 65 technical talks and 10 hands-on sessions, hear keynotes from Jeanne Harris, the co-author of the best-seller Competing on Analytics, and other industry leaders, learn the latest trends in data warehousing, business intelligence and analytics best practices, learn how to overcome common challenges and network with your peers.

Learn how to improve data warehouse query performance by a factor of 10x with Oracle Exadata and hear firsthand from Oracle Senior Executives and other experts about the revolutionary new HP Oracle Exadata Storage Server and HP Oracle Database Machine and how they fit into Oracle’s data warehousing strategy.

Keynotes
Technical Talks and Hands-on Sessions  also see BIWA Summit agenda
Travel and Logistics  also see BIWA Summit invitation
Sponsors
REGISTER for BIWA now!  - Last year's BIWA Summit 2007 was sold out, so sign up now to reserve your space.

BIWA Summit 2008
December 2 - 3, 2008





Labels:

Monday, September 08, 2008 

Collective Intelligence 1: Building a RSS Feed Archive

For a long time I have thought that we needed data mining books written for developers. Most data mining books are written for business or data analysts. Given that, it was a pleasant surprise to read Programming Collective Intelligence: Building Smart Web 2.0 Applications by Toby Segaran. The book provides a good discussion on data mining concepts anchored with interesting examples. It also provides guidance on when to use the techniques. I still think that there is room for improvement on how data mining and analytics should be presented to developers. However, the book is a great step forward in enabling developers to use analytics.

Toby's book has many Web 2.0 examples illustrated with data obtained from different web sites (e.g., blogs, Facebook, Zebo, etc). The examples are coded in Python or rely on freely available software for the data gathering and analytics. This got me thinking. If one wanted to illustrate how to create scalable and robust applications exploiting collective intelligence, would it not be interesting to replicate some of those examples utilizing the analytical and XML processing capabilities in the Oracle database? So I decided to do exactly that. This is the first post in a series showing how we can do the problems discussed in the book using technology in the Oracle database. Although most that is described in the book can be done using the Oracle database, this series will only showcase some of the examples in the book. To implement all of them would be like writing a new book.

Before we can start mining we need to collect data and store it in a database. There are many Web 2.0 websites with interesting data for mining. XML is the de facto data format returned by these websites. This post covers the basic steps on how to get this data by showing how to build an archive of entries from a list of RSS feeds. In later posts I will describe the mining.

First the good news, the Oracle RDBMS has a great set of features supporting XML and HTTP processing. Now the bad news, before writing this post I knew nothing about those features. So, after some digging around the web, I found lots of good information in a couple of articles. I have to say that I am really impressed by what can be done with the functionality in the database.

Below I describe some preparatory steps needed, how to query multiple RSS feeds at once, and finally how to automate the whole process to build a RSS feed archive.

Preparatory Steps
Lucas Jellema has a nice series of articles describing how to build a RSS feed reader and archive (1, 2). I used them as a departing point. The comments in the first article pointed me in the right direction to move from a PL/SQL implementation, as described in the article, to the SQL version given below.

Here is a list of things that one need to do before running the code in Lucas' articles or the code below:

  • Configure fine-grained access using ACL (access control lists) - this is needed for 11g
  • Take care of proxy server - optional if you have a proxy server
  • Configure the database character set
Configuring Fine-Grained Access
Oracle provides access to external network services through several PL/SQL APIs (UTL_TCP, UTL_SMTP, UTL_MAIL, UTL_HTTP and UTL_INADDR). Before Oracle 11g, access to these external services was based on whether a user was granted execute permissions on a specific package. Oracle 11g introduced fine-grained access to network services using access control lists (ACL). ACL allows to control which users access which network resources regardless of package grants.

Basically we need to do the following to be able to take advantage of the features in these packages:
  • Create an access control list (ACL)
  • Assign the ACL to the network
CONN sys/manager AS SYSDBA
GRANT EXECUTE ON UTL_HTTP TO dmuser;
BEGIN
  DBMS_NETWORK_ACL_ADMIN.create_acl (
    acl          => 'mining_the_web.xml', 
    description  => 'ACL functionality for mining the web',
    principal    => 'DMUSER',
    is_grant     => TRUE, 
    privilege    => 'connect',
    start_date   => SYSTIMESTAMP,
    end_date     => NULL);

  COMMIT;
END;
/
BEGIN
  DBMS_NETWORK_ACL_ADMIN.assign_acl (
    acl         => 'mining_the_web.xml',
    host        => '*', 
    lower_port  => 80,
    upper_port  => NULL); 

  COMMIT;
END;
/
The assign_acl call above is very generous. It allows the DMUSER user to connect to any server (*) through port 80. For more details take a look at this nice article by Tim Hall.

Handling Proxies
If your machine sits behind a firewall you need to account for proxies by executing the following command in your session:
EXEC utl_http.set_proxy('proxy.my-company.com', 'corp.my-company.com');
For more details see the documentation for the UTL_HTTP package (link).

Configuring Database Character Set
When retrieving XML from RSS feeds it is a good idea to use a database with a Unicode character set, for example AL32UTF8. This prevents getting an error when trying to persist the XML content to a table.

Querying Multiple RSS Feeds
First we create a table rss_feeds to hold the RSS feeds we want to track.
DROP TABLE rss_feeds PURGE;
CREATE TABLE rss_feeds
(title    varchar2(200), 
 url      varchar2(500), 
 category varchar2(50));
Next we insert a couple of RSS feeds in the table assigning a category to each feed, such as: news, technology, etc.
INSERT INTO rss_feeds VALUES ('iTunes Top 25 Songs', 'http://ax.phobos.apple.com.edgesuite.net/WebObjects/MZStore.woa/wpa/MRSS/topsongs/limit=25/rss.xml','music');
INSERT INTO rss_feeds VALUES ('Apple Hot News', 'http://images.apple.com/main/rss/hotnews/hotnews.rss','Apple');
INSERT INTO rss_feeds VALUES ('Oracle Press Releases', 'http://www.oracle.com/rss/rss_ocom_pr.xml','Oracle');
INSERT INTO rss_feeds VALUES ('OTN Articles', 'http://www.oracle.com/technology/syndication/rss_otn_articles.xml','Oracle');
INSERT INTO rss_feeds VALUES ('CNN', 'http://rss.cnn.com/rss/cnn_topstories.rss','news');
INSERT INTO rss_feeds VALUES ('BBC', 'http://newsrss.bbc.co.uk/rss/newsonline_world_edition/front_page/rss.xml','news');
INSERT INTO rss_feeds VALUES ('NYT', 'http://www.nytimes.com/services/xml/rss/nyt/HomePage.xml','news');
INSERT INTO rss_feeds VALUES ('WPost Technology', 'http://www.washingtonpost.com/wp-dyn/rss/technology/index.xml','technology');
INSERT INTO rss_feeds VALUES ('Slashdot', 'http://rss.slashdot.org/Slashdot/slashdot','technology');
Now we can use the SQL statement below to query the RSS feeds and map the XML documents retrieved to a relational schema:
SELECT t.source, p.scategory, it.title, it.link, it.pubdate, 
       it.author,
       XMLCAST(
         XMLQUERY('let $b := count(//category)
                   for $i at $pos in //category
                     return ($i, if ($pos = $b) then () else ", ")'
                  PASSING it.category RETURNING CONTENT
                 ) AS VARCHAR2(4000)
               ) category,
       it.description
FROM (SELECT HTTPURITYPE(url).getXML() doc, category scategory 
      FROM rss_feeds) p,
     XMLTABLE('/rss/channel' PASSING p.doc
              COLUMNS source VARCHAR2(4000) PATH 'title',
                      item XMLTYPE          PATH 'item') t,
     XMLTABLE('item' PASSING t.item
              COLUMNS title       VARCHAR2(4000) PATH 'title',
                      link        VARCHAR2(4000) PATH 'link',
                      pubDate     VARCHAR2(4000) PATH 'pubDate',
                      author      VARCHAR2(4000) PATH '*[namespace-uri()="http://purl.org/dc/elements/1.1/" and local-name()="creator"]',
                      category    XMLTYPE        PATH 'category',
                      description CLOB           PATH 'description') it;
The above query has a number of interesting features. It uses the HTTPURITYPE function to extract the XML for each RSS feed in the rss_feeds table. The XMLTABLE statements then process the XML document returned from each feed to extract the desired pieces of information and convert the document into a relational source. The first XMLTABLE statement extracts the tile and the item elements from each XML document returned by p. Because each XML document contains multiple items, a second XMLTABLE statement is used to return the pieces of each item element extracted by the first XMLTABLE. This XMLTABLE statement also returns the category as XMLTYPE. This is needed because some sites have multiple category elements for a given item. The category elements are then extracted using the XMLQUERY function so that we have all these elements concatenated as a single string separated by commas. For alternative ways of doing the last bit take a look at this post by Mark Volkmann.

One problem with the above query is that if one of the RSS feed sites is down or is unreachable the whole query will fail. Given the nature of the web this is not an unlikely event if we are trying to collect feeds from many sites. To overcome this problem, we can implement a PL/SQL function that wraps  the HTTPURITYPE function and validates the URL before invoking the HTTPURITYPE function. If the site for a URL does not respond, the function returns NULL. The code in the sections below follows this approach.

Building the RSS Feed Archive
To build the RSS feed archive we want to periodically run the above query and persist to a table only the entries that have not been stored already.

First let's create a table for holding the RSS feed entries:
DROP TABLE rss_history PURGE;
CREATE TABLE rss_history
(source      varchar2(4000),
 scategory   varchar2(200),
 title       varchar2(4000),
 link        varchar2(4000),
 pubDate     varchar2(200),
 author      varchar2(4000),
 category    varchar2(4000),
 description CLOB);
Next we define a function to validate a RSS feed's URL and return the XML for the feed.
CREATE OR REPLACE FUNCTION rss_getXML(url IN varchar2) 
                           RETURN XMLType
IS
  req   UTL_HTTP.REQ;
  resp  UTL_HTTP.RESP;
  res   XMLType := null;
BEGIN
  req  := UTL_HTTP.BEGIN_REQUEST(url);
  resp := UTL_HTTP.GET_RESPONSE(req);  
  IF (resp.status_code >=400 AND
      resp.status_code <= 599) 
  THEN
    res := null;
  ELSE
    res := httpuritype(url).getXML();  
  END IF;
  UTL_HTTP.END_RESPONSE(resp);  
  RETURN res;

  EXCEPTION
    WHEN OTHERS THEN
      UTL_HTTP.END_RESPONSE(resp);
      RETURN res;
END;
/
We can now use the following MERGE statement to merge only newer entries into the archive:
MERGE 
INTO rss_history rh 
USING (SELECT t.source, p.scategory, it.title, it.link, it.pubdate, 
              it.author, 
              XMLCAST(
                XMLQUERY('let $b := count(//category)
                          for $i at $pos in //category 
                            return ($i, if ($pos = $b) then () else ", ")'
                         PASSING it.category RETURNING CONTENT
                        ) AS VARCHAR2(4000) 
                     ) category,
              it.description
       FROM (SELECT rss_getXML(url) doc, category scategory 
             FROM rss_feeds) p,
            XMLTABLE('/rss/channel' PASSING p.doc
                     COLUMNS source VARCHAR2(4000) PATH 'title',
                             item XMLTYPE PATH 'item') t,
            XMLTABLE('item' PASSING t.item
                     COLUMNS title       VARCHAR2(4000) PATH 'title',
                             link        VARCHAR2(4000) PATH 'link',
                             pubDate     VARCHAR2(4000) PATH 'pubDate',
                             author      VARCHAR2(4000) PATH '*[namespace-uri()="http://purl.org/dc/elements/1.1/" and local-name()="creator"]',
                             category    XMLTYPE        PATH 'category',
                             description CLOB           PATH 'description') it
      ) rl
ON (rh.link = rl.link)
WHEN NOT MATCHED 
THEN INSERT (title, source, scategory, link, pubDate, 
             author, category, description)
     VALUES (rl.title, rl.source, rl.scategory, rl.link, rl.pubDate, 
             rl.author, rl.category, rl.description);
It is very impressive that a single SQL statement can retrieve entries from multiple RSS feeds, parse the XML, and merge the newer entries into a table for archival. 

Finally let's setup a DBMS_SCHEDULER job to run this merge statement every fifteen minutes. First we create a PL/SQL procedure wrapping the above MERGE statement:
CREATE OR REPLACE PROCEDURE update_rss_history
 (p_proxy            IN VARCHAR2,
  p_no_proxy_domains IN VARCHAR2 DEFAULT NULL) 
IS
BEGIN
  -- When going through a firewall, pass requests through p_proxy host.
  -- Specify sites (p_no_proxy_domains) inside the firewall that don't 
  -- need the proxy host.
  IF (p_proxy is not null) AND
     (p_no_proxy_domains is not null)
  THEN
    UTL_HTTP.SET_PROXY(p_proxy, p_no_proxy_domains);
  END IF;  
  
  MERGE 
  INTO rss_history rh 
  USING ( SELECT t.source, p.scategory, it.title, it.link, it.pubdate, 
                 it.author, 
                 XMLCAST(
                   XMLQUERY('let $b := count(//category)
                             for $i at $pos in //category 
                               return ($i, if ($pos = $b) then () 
                                           else ", ")'
                            PASSING it.category RETURNING CONTENT
                           ) AS VARCHAR2(4000) 
                        ) category,
                 it.description
          FROM (SELECT rss_getXML(url) doc, 
                       category scategory 
                FROM rss_feeds) p,
               XMLTABLE('/rss/channel' PASSING p.doc
                        COLUMNS source VARCHAR2(4000) PATH 'title',
                                item XMLTYPE PATH 'item') t,
               XMLTABLE('item' PASSING t.item
                        COLUMNS title       VARCHAR2(4000) PATH 'title',
                                link        VARCHAR2(4000) PATH 'link',
                                pubDate     VARCHAR2(4000) PATH 'pubDate',
                                author      VARCHAR2(4000) PATH '*[namespace-uri()="http://purl.org/dc/elements/1.1/" and local-name()="creator"]',
                                category    XMLTYPE        PATH 'category',
                                description CLOB           PATH 'description') it
        ) rl
  ON (rh.link = rl.link)
  WHEN NOT MATCHED 
  THEN INSERT (title, source, scategory, link, pubDate, 
               author, category, description)
       VALUES (rl.title, rl.source, rl.scategory, rl.link, rl.pubDate, 
               rl.author, rl.category, rl.description);  
END;     
/
Besides the MERGE statement, the above procedure also includes some code for handling proxy servers as explained above.

Next we create the job that will execute this procedure every fifteen minutes:
BEGIN
DBMS_SCHEDULER.CREATE_JOB(
   job_name        =>  'UPDATE_RSS_ARCHIVE',
   job_type        =>  'PLSQL_BLOCK',
   job_action      =>  'BEGIN 
                          update_rss_history();
                          COMMIT;
                        END;',
   enabled         =>  TRUE,
   start_date      =>  SYSDATE,
   repeat_interval => 'FREQ=MINUTELY;INTERVAL=15');
END;
/
For more details on DBMS_SCHEDULER see the Oracle documentation (link). 

To check on status of the job we can use the following query:
SELECT job_name, enabled, 
       TO_CHAR(last_start_date,'YY-MM-DD HH24:MI:SS') last_run,
       TO_CHAR(last_run_duration,'HH:MI:SS') 
FROM user_scheduler_jobs;
To manually stop and/or drop the job we can use the following calls:
-- Stop job
EXECUTE DBMS_SCHEDULER.stop_job (job_name => 'update_rss_archive');

-- Drop job
EXECUTE DBMS_SCHEDULER.drop_job (job_name => 'update_rss_archive');
After a few days running the archive has accumulate quite a few entries:
SELECT COUNT(*) FROM rss_history;

  COUNT(*)
----------
      1558
In later posts I will show how to mine the archive.

Labels: ,

Friday, August 22, 2008 

Data Mining in Action: Oracle Sales Prospector

I firmly believe that a major trend in applications is the incorporation of analytic-enabled functionality. Users want more than just reports or a replay of the past. Users want to have insights and their attention directed to key points. This is where analytics can make a big impact across all types of applications. Notice that I am not proposing exposing analytical capabilities (e.g., data mining and statistical modeling) to users. That, I think, is only effective for a small number of users. What I believe is more meaningful is to present functionality and information to users that are the result of analytics taking place behind the scene.

In line with this trend, Oracle has recently released a new application: Oracle Sales Prospector. This application is targeted at sales representatives. It provides which accounts they should target with specific products or services. It also indicates which deals are most likely to close within specific time frames, and provides accompanying corporate profiles as well as likely customer references.

This is a very cool application and I have wanted to talk about it for some time now. My group has worked closely with the Social CRM group in developing the data mining driving the product, and now that the it has launched I can discuss some of the data mining details.

Oracle Sales Prospector is as analytical-driven as an application can be. Almost every aspect of what you seen in its very nice user interface is the product of data mining models. Oracle Data Mining (ODM) provides the algorithms and the real-time scoring driving the application.


Oracle Sales Prospector main screen


The product functionality is presented to the user in a easy to interact screen (see main screen above). Sales opportunities are presented in a bubble chart at the left of the main screen. Each bubble is an opportunity. An opportunity is a (customer, product) pair. The size of the bubble reflects the expected (estimated) revenue for the opportunity, the vertical placement how likely it is to close that opportunity, and the horizontal placement the expected number of months to close it.


Opportunity bubble chart

If the user provides no filtering criteria, the application scores all customers and products associated with the sales representative and display the top ones (e.g., ten) ranked by expected revenue in the bubble chart. This is done in real-time and scales very nicely to large numbers of products and customers. One of the benefits of having data mining tightly integrated with a robust and scalable database server like Oracle's.


Filtering interface (upper left side of the application main screen)

The user can filter opportunities by product or customer. It is also possible to constraint the opportunities to those with a projected revenue size above a given value and or probability of sale above a given threshold. Once the user specifies the filtering criteria, the application scores all customers and products in real-time and returns those that match the criteria. The real-time aspect of it ensures that the latest pieces of information (e.g., current sales data) are taken into account when assessing opportunities.

The opportunity bubble chart provides a lot of functionality and leverages different types of modeling techniques. It uses, in different ways, regression, clustering, and association techniques to provide scalable, accurate, and transparent recommendations (customer-product pairs) as well as estimates for time to close and expected revenue estimates. Robustness and scalability are important principles to keep in mind when design analytical applications. It is better to sacrifice some accuracy and select more robust and scalable techniques that will behave well with a variety of data distributions and data quality conditions.


Reference pane (lower right side of the main screen)

For each opportunity (bubble) the application also returns a list of likely references. These are customers that are like the one in the opportunity but that have already purchased the product in the opportunity. Customer segmentation using clustering is used for this part of the application. Similarity between customers is computed taking into account both purchasing behavior and company information (e.g., size and industry).

For more information on Oracle Sales Prospector check out its web site here and the Social CRM blog here.

Labels: , ,

Monday, August 18, 2008 

Matrix Inversion Using PL/SQL

Recently someone asked me how to invert a matrix using the UTL_NLA PL/SQL package. This can be done by solving a system of linear equations AX = B like the ones I described in this post but setting B to the identity matrix. I thought that the question was interesting enough to deserve its own post. So I am replicating the answer here for those interested.

The UTL_NLA package has many different procedures for solving systems of linear equations. Most of them are designed to take advantage of special form matrices (e.g., triangular, banded, etc.). For the example below, I picked the procedure LAPACK_GELS. This procedure solves general systems of linear equations in a robust fashion. It uses QR or LQ decompositions as needed.

DECLARE
  A     utl_nla_array_dbl :=  utl_nla_array_dbl(
               5,   153,  352, 
               153, 5899, 9697,
               352, 9697, 26086);
  B     utl_nla_array_dbl := utl_nla_array_dbl( 
               1, 0, 0,
               0, 1, 0,               
               0, 0, 1);
  ipiv  utl_nla_array_int := utl_nla_array_int(0,0);
  info  integer;
  n     NUMBER := 3;
  i     NUMBER;
  j     NUMBER;
BEGIN

  UTL_NLA.LAPACK_GELS (
    trans  => 'N',  -- solve for A instead of A'
    m      => n,    -- A number of rows 
    n      => n,    -- A number of columns
    nrhs   => n,    -- B number of columns
    a      => A,    -- matrix A
    lda    => n,    -- max(1, m)
    b      => B,    -- matrix B
    ldb    => n,    -- ldb >= max(1, m, n)
    info   => info, -- operation status (0=sucess)
    pack   => 'R'   -- how the matrices are stored 
                    -- (C=column-wise)
  );

  dbms_output.put_line('inv(A)');
  dbms_output.put_line('-------------');
  dbms_output.put_line('info= '||info);
  IF info=0 THEN
    FOR l IN 1..B.count LOOP
      i := ceil(l/n);
      j := l-(i-1)*n;
      dbms_output.put_line('x[' || i ||']['||j||']= ' || 
                           TO_CHAR(B(l),'99.9999'));
    END LOOP;
  END IF;
END;
/
Here is the output for the above code:
inv(A)
-------------
info= 0
x[1][1]=  27.5307
x[1][2]=   -.2658
x[1][3]=   -.2727
x[2][1]=   -.2658
x[2][2]=    .0030
x[2][3]=    .0025
x[3][1]=   -.2727
x[3][2]=    .0025
x[3][3]=    .0028

Labels: ,

Friday, August 15, 2008 

Oracle at KDD 2008 and KDD 2007 Pictures

It is that time of the year again. In about a week I am going to be attending the KDD (Knowledge Discovery in Databases) 2008 conference (conference website) along with some other Oracle colleagues. KDD is one of the primary conferences on data mining. This year it will take place in Las Vegas, Nevada, from August 24 to 27.

Oracle is a Gold sponsor for the event and will have again a large presence at the conference. Among other things, Oracle is the sponsor of the first poster reception and demonstration session on Monday.

This year I will not try to post daily conference highlights. After managing to do a poorer job last year than in the year before I gave up on the idea. Instead I will settle for a single post after the conference.

Here are some pictures from the Oracle booth at last year's KDD:


If you plan to attend KDD this year, stop by the Oracle booth to say hello. Last year I spent a great deal time at the Oracle booth. There were lots of interesting questions and problems. Here are some of the topics we talked about:

  • How to combine data mining with a rules engine?
  • How to schedule a background process for anomaly detection in tables?
  • How to use database triggers to score new rows or modification of rows?
Because Oracle has a large array of features in the RDBMS, it is possible to come up with very interesting solutions to these and other questions by combining these features.

Labels: ,

Thursday, December 06, 2007 

Recap Post

For the past couple of months the blog took a back seat. Basically, since KDD, I have had very little time to write. I have been on the road quite a bit and my trip to KDD unleashed a number of research ideas that I have been following up. I will post on the latter over time as the results mature.

I have also dropped the ball answering many of the emails and comments I have received. I have caught up with the blog email now. Usually I try to reply very quickly and I apologize for those that have had their question unanswered for a very long time. Below are some of the things that have been going on in the past couple of months. I will blog about them in more details very soon.

  • KDD summary
  • Talks at ACM KDD-SIG and Oracle's Server Technology Seminar
  • Talks at the BIWA Summit
  • Talk at University of Rhode Island
  • Oracle Data Mining 11g
  • Analytics in Oracle Open World
  • SAS in-database analytics announcement
  • Data Mining and OLAP integration

Labels:

Sunday, August 12, 2007 

KDD 2007

For the next couple of days I am going to be attending the KDD (Knowledge Discovery in Databases) 2007 conference (conference website) along with some other Oracle colleagues. KDD is one of the primary conferences on data mining. This year it will take place in San Jose, CA, from August 12 to 15.

Oracle is a Gold sponsor for the event and will have a large presence at the conference. Among other things, Oracle is the sponsor of the first poster reception and demonstration session on Monday.

I will try to post daily conference highlights and do a better job than last year when I only covered the first two days. If you plan to attend KDD this year, stop by the Oracle booth.

Labels:

Monday, July 09, 2007 

On the Road and Upcoming Talks

This week I am going to be in San Francisco. I have been invited to give a talk at the San Francisco Bay ACM Data Mining SIG on Wednesday. The title of the talk is In-Database Analytics: A Disruptive Technology. Here is a link with information on the talk.

On Friday morning, I am presenting at the ST Seminar at Oracle's headquarter. The title of that talk is In-Database Mining: The I in BI. If you are in area for the external event on Wednesday or at headquarters on Friday, stop by.

I will be posting links to both presentations later on.

Labels:

Monday, June 04, 2007 

Way Cooler: PCA and Visualization - Linear Algebra in the Oracle Database 2

This post shows how to implement Principal Components Analysis (PCA) with the UTL_NLA package. It covers some of the uses of PCA for data reduction and visualization with a series of examples. It also provides details on how to build attribute maps and chromaticity diagrams, two powerful visualization techniques.

This is the second post in a series on how to do Linear Algebra in the Oracle Database using the UTL_NLA package. The first illustrated how to solve a system of linear equations. As promised in part 1, I am posting the code for a couple of procedures for reading and writing data from tables to arrays and vice-versa. Links to the data and code used in the examples are at the end of the post.

Principal Components Analysis
In data sets with many attributes groups of attributes often move together. This is usually due to the fact that more than one attribute may be measuring the same underlying force controlling the behavior of the system. In most systems there are only a few such driving forces. In these cases, it should be possible to eliminate the redundant information by creating a new set of attributes that extract the essential characteristics of the system. In effect we would like to replace a group of attributes with a single new one. PCA is a method that does exactly this. It creates a new set of attributes, the principal components, where each one of them is a linear combination of the original attributes. It is basically a special transformation of the data. It transforms numerical data to a new coordinate system where there is no redundant information recorded in the attributes.

Figure 1: A two-dimensional data distribution.

It is easier to understand what PCA does by looking at a couple of pictures (see Technical Note 1 below for a more technical explanation). Let's assume that we have a table (data set) with two columns (X1 and X2) and that all the rows (observations) in this table falls within the gray ellipsis in Figure 1. The double-edged arrows in the picture indicate the range for the X1 and X2 attributes. A couple of things to notice in the picture:
  • The data dispersion for X1 and X2 is about the same
  • The distribution of the data falls along a diagonal pattern
The first point indicates that information from both attributes is required to distinguish the observations. The second point indicates that X1 and X2 are correlated. That is, when X1 is large X2 is also usually large, and when X1 is small X2 is also usually small.

What PCA does is to compute a new set of axes that can describe the data more efficiently. The idea is to rotate the axes so that the new axes (also called the principal components, i.e., PCs for short) are such that the variance of the data on each axis goes down from axis to axis. The first new axis is called the first principal component (PC1) and it is in the direction of the greatest variance in the data. Each new axis is constructed orthogonal to the previous ones and along the direction with the largest remaining variance. As a result the axes are labeled in order of decreasing variance. That is, if we label PC1 the axis for the first principal component then the variance for PC1 is larger than that for PC2 and so on.

Figure 2: PC1 and PC2 axes and their ranges.

Figure 2 shows the new axes PC1 and PC2 created by PCA for the example in Figure 1. As indicated by the sizes of the double-edged arrows, there is more variability (greater variance) on the PC1 axis than on the PC2 axis. Also the PC1 variance is bigger than that in the original attributes (X1 and X2). This is not surprising as PC1 is constructed to lie on the direction of the greatest variance in the data. The fact that PC2 has less variability means that it is less important for differentiating between observations. If there were more PCs, the data projection on these axes would have less and less variability. That is, they would capture less and less important aspects of the data. This is what allows compressing the data. By ignoring higher-order PCs we can great a new version of the data with fewer attributes than the original data.

Figure 3: Coordinates for a point on the original axes and on the PC axes.

Figure 3 shows the coordinates for a point in the original coordinate system (attributes X1 and X2) and in the PC coordinate system (PC1 and PC2). We can see that, for the PC coordinate system, the majority of points can be distinguished from each other by just looking at the value of PC1. This is not true for the original coordinate system. For example, there are many fewer points with coordinate b1 than with coordinate a1.

The following are the steps commonly taken when working with PCA:
  • Center and/or scale the data. If an attribute spans different orders of magnitudes then first LOG transform the attribute
  • Compute PCs
  • Look at the variance explained (inspection of the Scree plot)
  • Select the number of PCs to keep
The selected reduced set of PCs can then be used for different types of activities depending on the task at hand. Some common applications are: data compression, faster searches, data visualization, and finding groups and outliers. These steps will become clearer in the examples below.

Implementing PCA with UTL_NLA
The PCA functionality is implemented in the package DMT_PROJECTION by the following procedures and functions (see code at the end of the post):
  • pca
  • pca_eval
  • mk_xfm_query
  • mk_attr_map
  • mk_pca_query
The pca procedure has two signatures. The first works with arrays and it is the core procedure. The second signature wraps the first one and works with tables. The wrapper procedure uses a set of utilities to convert from table to array and vice-versa (see Technical Note 3). The wrapper procedure also estimates the number of PCs to keep and produces a query string that can be used to score new data. Technical Note 2 below describes the wrapper procedure in more details. All the examples below use the wrapper procedure.

The pca_eval procedure estimates the ideal number of PCs to keep (see discussion in Example 3 below). It also computes the information for a Scree plot and a Log-Likelihood plot. The latter is used to estimate the number of PCs based on the method described in [ZHU06].

The function mk_xfm_query generates a query string that transforms a source data set to make it more suitable to PCA. It supports three approaches: Centralize only (subtract the mean), scale only (divide by the standard deviation), and z-score (centralize and scale). The scripts for the examples below show how this function is used.

The mk_attr_map procedure creates 2D attribute maps for a set of attributes. The names of the columns with the two coordinates for the map are provided in the argument list. The result table contains the attribute maps for all the attributes listed in the argument list. The first coordinate labels the rows and the second coordinate labels the columns of this table. Example 2 showcases the use of this procedure for data visualization.

The function mk_pca_query generates a query string that can be used to compute the PC representation of data in a table, view, or query. To use this query it is only necessary to append a table name or subquery to the end of the returned string. This procedure is used by the pca wrapper procedure.

Example 1: Planets - Visualizing Major Features of the Data
The data for this example come from François Labelle's website on dimensionality reduction. The data contain the distance of each planet to the Sun, the planet's equatorial diameter, and the planet's density. The goal in this example is to find interesting features in the data set by visualizing it in 2D using PCA.

The following script transforms the data and finds the principal components using the pca wrapper procedure (see comments in the DMT_PROJECTION package code for a description of the arguments):
DECLARE
v_source long;
v_score_query long;
v_npc NUMBER;
BEGIN
v_source :=
'SELECT name, log(10,distance) distance, ' ||
'log(10,diameter) diameter, '||
'log(10,density) density FROM planet';
v_source :=
dmt_projection.mk_xfm_query(v_source,
dmt_matrix.array('NAME'),'C');

-- Compute PCA
dmt_projection.pca(
p_source => v_source,
p_iname => 'NAME',
p_s_tab_name => 'planet_s',
p_v_tab_name => 'planet_v',
p_u_tab_name => 'planet_u',
p_scr_tab_name => 'planet_scr',
p_score_query => v_score_query,
p_num_pc => v_npc
);
END;
/
Because the data span a couple orders of magnitudes I applied the LOG transformation before centering the data using the mk_xfm_query function.

Table 1: planet_scr table.

Figure 4: Scree plot for the planetary data.

The planet_scr table (see Table 1) contains data for the Scree and Log-Likelihood plots. The Scree plot (Figure 4) charts all the eigenvalues in their decreasing order (from lower to higher PCs). I was curious so I checked out why it is called a Scree plot. The word scree comes from the Old Norse term for landslide. The term is used in this context because the plot resembles the side of a mountain with debris from the mountain lying at its base. The Log-Likelihood plot is covered in Example 3 below.

Table 1 shows that the first principal component (PC1) alone accounts for about 69% of the explained variance (PCT_VAR column). Adding PC2 brings it up to 98.8% (CUM_PCT_VAR column). As a result we can describe the data quite well by only considering the first two PCs.

Figure 5: Projection of planet data onto the first two PCs.

Figure 5 shows the distribution of the planets based on their PC1 (horizontal) and PC2 (vertical) coordinates. These numbers (scores) are persisted in the planet_u table. The red arrows show the projection of the original attributes in the PC coordinate system. The arrows point in the direction of increasing values of the original attribute. The direction of the arrows can be obtained from the V matrix returned by the pca procedure in the planet_v table. Each row in this table gives the PC coordinates for one of the original attributes.

The picture reveals some interesting features about the data. Even someone not trained in Astronomy would easily notice that Pluto is quite different from the other planets. This should come as no surprise given the recent debate about Pluto being a planet or not. It is also quite evident that there are two groups of planets. The first group contains smaller dense planets. The second group contains larger less dense planets. The second group is also farther away from the sun than the first group.

Example 2: Countries - Visualizing Attribute Information with Attribute Maps
The previous example highlighted how principal component projections can be used to uncover outliers and groups in data sets. However, it also showed that is not very easy to relate these projections to the original attributes (the red arrows help but can also be confusing). An alternative approach is to create attribute maps in the lower dimensional projection space (2D or 3D). For a given data set, attribute maps can be compared to each other and allow to visually relate the distribution of different attributes. Again it is easier to explain the idea with an example.

The data for this example also come from François Labelle's website on dimensionality reduction. The data contain the population, average income, and area for twenty countries.

Figure 6: Country data projection onto the first two PCs.

Figure 5 shows the projection of the LOG of the data in the first two PCs (PC1 is the horizontal axis). From the picture we see that countries with large populations are in the upper left corner. Countries with large areas and large average incomes are closer to the lower right corner. It is harder to get insight on the countries in the middle. Also this type of analyzes becomes more and more difficult as the number of attributes increases (more red arrows in the picture).

We can construct attribute maps with the following steps:
  • Start with the PC1 and PC2 coordinates used to create a map like the one in Figure 5
  • Bin (discretize) the PC1 and PC2 values into a small number of buckets.
  • Aggregate (e.g., AVG, MAX, MIN, COUNT) the attribute of interest over all combinations of binned PC1 and PC2 values, even those not represented in the data so that we have a full matrix for plotting. The attribute being mapped can be one of those used in the PCA step (population, average income, area) or any other numeric attribute
  • Organize the results in a tabular format suitable for plotting using a surface chart: rows index one of the binned PCs and columns the other
The DMT_PROJECTION.MK_ATTR_MAP procedure implements all these steps. This procedure relies on DMT_DATA_MINING_TRANSFORM package for binning the data. So you need to have the Oracle Data Mining option installed to use it. You can always comment out this procedure or replace the binning with your own if you do not have the Data Mining option.

Using DMT_PROJECTION.MK_ATTR_MAP procedure I computed attribute maps for Population, Average Income, and Area. Here is the code:
DECLARE
v_source long;
v_score_query long;
v_npc NUMBER;
BEGIN
v_source :=
'SELECT name, log(10,population) log_pop, ' ||
'log(10,avgincome) log_avgincome, log(10,area) ' ||
'log_area FROM country';
v_source := dmt_projection.mk_xfm_query(v_source,
dmt_matrix.array('NAME'),'Z');

-- Compute PCA
dmt_projection.PCA(
p_source => v_source,
p_iname => 'NAME',
p_s_tab_name => 'country_s',
p_v_tab_name => 'country_v',
p_u_tab_name => 'country_u',
p_scr_tab_name => 'country_scr',
p_score_query => v_score_query,
p_num_pc => v_npc
);
DBMS_OUTPUT.PUT_LINE('Example 2 - NumPC= ' || v_npc);

-- Define query to create maps
v_source := 'SELECT a.name, log(10,population) population, '||
'log(10,avgincome) avgincome, '||
'log(10,area) area, '||
'b.pc1, b.pc2 ' ||
'FROM country a, country_u b ' ||
'WHERE a.name=b.name';

-- Create maps
dmt_projection.mk_attr_map(
p_source => v_source,
p_nbins => 4,
p_iname => 'NAME',
p_coord => dmt_matrix.array('PC2','PC1'),
p_attr => dmt_matrix.array('POPULATION',
'AVGINCOME','AREA'),
p_op => 'AVG',
p_bin_tab => 'COUNTRY_BTAB',
p_bin_view => 'COUNTRY_BVIEW',
p_res_tab => 'COUNTRY_ATTR_MAP');

END;
/
The code has two parts. First it does PCA on the data. Next it creates the maps. The maps are persisted in the COUNTRY_ATTR_MAP table. The bin coordinates for each country are returned in the COUNTRY_BVIEW view. The query used for creating the maps joins the original data with the PCA projections from the country_u table. To create a map for attributes not in the original data we just need to edit this query and replace the original table in the join with the new table containing the data we want to plot.

For this example I set the number of bins to 4 because the data set is small. The more rows in the data the more bins we can have. This usually creates smoother maps. The procedure supports either equi-width or quantile binning. If we use quantile binning the maps display an amplification effect similar to a Self-Organizing Map (SOM). Because we create more bins in areas with higher concentration of cases, the map "zooms in" these areas and spreads clumped regions out. This improves visualization of dense regions at a cost that the map now does not show distances accurately. A map created with quantile binning is not as useful for visual clustering as one created with equi-width binning or using the original PC projections as in Figure 5.

Figure 7: Population attribute map.

Figure 8: Average income attribute map.

Figure 9: Area attribute map.

Figures 6-8 display the three maps using MS Excel's surface chart. The maps place the countries on a regular grid (PC1 vs. PC2, where PC1 is the horizontal axis) by binning the values of the coordinates PC1 and PC2 into four equi-width buckets. The colors provide a high to low landscape with peaks (purple) and valleys (blue). Each point in the grid can contain more than one country. I labeled a few countries per point for reference purpose. The value of the attribute at a grid node represents the average (AVG) value for the countries in that grid node. Values (and colors) between grid points are automatically created by interpolation by the surface chart routine in Excel.

The maps provide a visual summary of the distribution of each attribute. For example, Figure 7 shows that China and India have very large populations. As we move away from China (the population peak) population drops. Figure 8 shows high average income countries in the middle of the map and cutting across to the upper right corner.

It is also very easy to relate the attribute values in one map to the attribute values in the other maps. For example, the maps show that, in general, countries with larger average income (purple in Figure 8) have smaller population (yellow in Figure 7), the USA being an exception. Also large countries (purple and red in Figure 9) have large populations (purple and red in Figure 6). Australia is the exception in this case.

Those interested in creating plots like Figures 6-8 with Excel's surface chart should take a look at Jon Peltier's excellent tutorial: Surface and Contour Charts in Microsoft Excel. The output produced by DMT_PROJECTION.MK_ATTR_MAP is already in a format that can be used for plotting with Excel or most surface charting tools. In my case I just selected from the table p_res_tab in SQLDeveloper and then pasted the selection directly into Excel.

Example 3: Electric Utilities - Scree Plots and Number of PCs
An important question in PCA is how many PCs to retain. If we project a data set with PCA for visualization we would like to know that the few PCs we are visualizing accounts for a significant share of the variation in the data. Also for compression applications we would like to select the minimum number of PCs that achieve good compression. Again this will become clearer with an example.

The data set for this example can be found in [JOH02] and Richard Johnson's web site (here). The data set gives corporate data on 22 US public utilities. There are eight measurements on each utility as follows:
  • X1: Fixed-charge covering ratio (income/debt)
  • X2: Rate of return on capital
  • X3: Cost per KW capacity in place
  • X4: Annual Load Factor
  • X5: Peak KWH demand growth from 1974 to 1975
  • X6: Sales (KWH use per year)
  • X7: Percent Nuclear
  • X8: Total fuel costs (cents per KWH)
The task for this example is to form groups (clusters) of similar utilities. An application, suggested in this lecture at MIT, where clustering would be useful is a study to predict the cost impact of deregulation. To do the required analysis economists would need to build a detailed cost model for each utility. It would save a significant amount of time and effort if we could:
  • Cluster similar utility firms
  • Build detailed cost models for just a "typical" firm in each cluster
  • Adjust the results from these models to estimate results for all the utility firms
Instead of applying a clustering algorithm directly to the data let's analyze the data visually and try to detect groups of similar utilities. First we use PCA to reduce the number of dimensions. Here is the code:
DECLARE
v_source long;
v_score_query long;
v_npc NUMBER;
BEGIN
v_source :=
'SELECT name, x1, x2, x3, x4, x5, x6, x7, x8 FROM utility';
v_source := dmt_projection.mk_xfm_query(v_source,
dmt_matrix.array('NAME'),
'Z');

-- Compute PCA
dmt_projection.pca(
p_source => v_source,
p_iname => 'NAME',
p_s_tab_name => 'utility_s',
p_v_tab_name => 'utility_v',
p_u_tab_name => 'utility_u',
p_scr_tab_name => 'utility_scr',
p_score_query => v_score_query,
p_num_pc => v_npc
);
DBMS_OUTPUT.PUT_LINE('Example 3 - NumPC= ' || v_npc);
END;
/
Next we need to determine if a small number of PCs (2 or 3) captures enough of the characteristics of the data that we can feel confident that our visual approach has merit. There are many heuristics that can be used to estimate the number of PCs to keep (see [ZHU06] for a review). A common approach is to find a cut-off point (a PC number) in the Scree plot where the change from PC to PC in the plot levels off. That is, the mountain ends and the debris start. Until not long ago, there were no principled low computational cost approaches to select the cut-off point on the Scree plot. A recent paper by Zhu and Ghodsi [ZHU06] changed that. The paper introduces the profile log-likelihood, a metric that can be used to estimate the number of PCs to keep. Basically one should keep all the PCs up to and including the one for which the Profile Log-Likelihood plot peaks.

The pca wrapper procedure implements the approach described in [ZHU06]. The estimated number of PCs is returned in the p_num_pc argument. The procedure also returns the information required to create a Scree plot and the Profile Log-Likelihood plot in the table specified by the p_scr_tab_name argument.

Figure 10: Scree plot for electric utility data.

Figure 11: Log-likelihood profile plot for electric utility data.

Figures 10 and 11 show, respectively, the Scree and the Log-Likelihood plots using the data persisted in the utility_scr table. The Profile Log-Likelihood plot indicates that we should keep the first three PCs (the plot peaks at PC3). This might not have been immediately obvious from the Scree plot itself. The cumulative percentage explained variance for the first three PCs is 67.5%. PC3 accounts for about 16.5% of the variance. By keeping only the first three PCs we capture a good deal of the variance in the data.

Figure 12: Electric utility data projection onto PC1 and PC2 axes.

Let's start our analysis by looking at the projection of the data on the first two PCs. This is represented in Figure 12. From the picture we can see that the values in PC1 (horizontal axis) divide the data into three sections: A left one with the Green group (San Diego), a right one with the Blue group, and a middle section with the remaining of the utilities. Along the PC2 axis the middle section separates into three subsections, colored red, yellow, and black in the picture. Finally, the Black group seems to have two subgroups separated by a line in the picture. This visual partition of the data is somewhat arbitrary but it is can be quite effective.

What happens if we add the information from PC3? Do we still have the same partitions? Can it help to further separate the yellow group?

Figure 13: Electric utility data projection onto PC1 and PC3 axes.

Figure 13 illustrates the projection of the data using the first three PCs (see Technical Note 4 for details on how to construct this plot). Each vertex of the triangle represents one of the first three PCs, the closer to the vertex the higher the value for the associated PC. The numbers have been scaled so that at a vertex the associated PC has a value of 1 and the other two PCs are zero. Also points at the line opposite to a vertex have a value of zero for the PC associated with the vertex. For example, San Diego has a value of almost 0 for PC1.

Most of the structures uncovered in Figure 12 are also present in Figure 13. The Blue and Red groups are preserved. San Diego (Green group) is still very different from the rest. One of the Black subgroups (Consolid, NewEngla, Hawaiian) was also kept intact. However, the other Black subgroup was merged with the Yellow group. We can also see that adding PC3 helped split the Yellow group into two subgroups. The partition of the Yellow utility firms takes place along a line that connects PC3 with the opposite side of the triangle.

Recapping, the task was to identify groups of similar utility firms. Using PCA we were able to visually identify 6 groups (Figure 13) with ease. Trying to uncover patterns by inspecting the original data directly would have been nearly impossible. The alternative would have been to use a clustering algorithm. The latter is a more sophisticated analysis requiring greater analytical skills from the user.

In a follow-up post I will show how to implement data interpolation using cubic splines. The implementation does not use the procedures in UTL_NLA, only the array types. However it is another application of Linear Algebra and it can be code against the UTL_NLA procedures.

Technical Note 1: The Math
Given the original data matrix X, PCA finds a rotation matrix P such that the new data Y=XP have a diagonal covariance matrix. The P matrix contains the coefficients used to create the linear combination of the original attributes that make up the new ones. Furthermore, P is a matrix with the eigenvectors of the covariance matrix of X in each column and ordered by descending eigenvalue size. We can achieve compression by truncating P. That is, we remove the columns of P that have the smallest eigenvalues. As a result Y is not only a rotated version of X but it is also smaller.

Let X be an m by n matrix where the mean of each column is zero. The covariance matrix of X is given by the n by n matrix C=X'X. The Singular Value Decomposition (SVD) of X is


The matrix U is m by n and has orthogonal columns, that is, U'U = I, where I is an n by n identity matrix. The matrix S is an n by n diagonal matrix. The matrix V is an n by n matrix with orthogonal columns, that is, V'V = I. Using this decomposition we can rewrite C as


From this expression we see that V is a matrix of eigenvectors of C and S2 is the matrix of eigenvalues of C, the column of both matrices are also ordered by decreasing size of the associated eigenvalue. For example, let Vi be the ith column of V then


where Si2 is the ith element of the diagonal of S2. It is also the eigenvalue associated with Vi.

Setting P=V also solves the PCA problem, that is, V is a rotation matrix with all the desired properties. To illustrate the point further, given the covariance of Y as


If we replace P with V we make Y'Y a diagonal matrix:


So V is a rotation matrix that turns the covariance matrix of Y into a diagonal matrix.

The scores Y on the new coordinate system can also be obtained from the results of SVD. Y can be computed by multiplying each column of U by the respective diagonal element of S as follows


This is what is done in the step that scales U in the pca core procedure (see Technical Note 2 below).

Technical Note 2: The Magic in the Box - Inside the Core and the Wrapper PCA Procedures
The actual code for computing PCA (the core PCA procedure) is very small. This procedure is a good example of how the UTL_NLA package allows implementing complex calculations with a few lines of code. The code is as follows:
  PROCEDURE pca ( p_a         IN OUT NOCOPY UTL_NLA_ARRAY_DBL,
p_m IN PLS_INTEGER,
p_n IN PLS_INTEGER,
p_pack IN VARCHAR2 DEFAULT 'R',
p_s IN OUT NOCOPY UTL_NLA_ARRAY_DBL,
p_u IN OUT NOCOPY UTL_NLA_ARRAY_DBL,
p_vt IN OUT NOCOPY UTL_NLA_ARRAY_DBL,
p_jobu IN BOOLEAN DEFAULT true
)
AS
v_indx PLS_INTEGER;
v_alpha NUMBER;
v_info PLS_INTEGER;
v_dim PLS_INTEGER := p_n;
v_ldvt PLS_INTEGER := p_n;
v_jobu VARCHAR2(1) := 'N';
BEGIN
IF p_jobu THEN
v_jobu := 'S';
END IF;
-- Get the correct dimensions to use
IF (p_m <= p_n) THEN
v_dim := p_m - 1;
v_ldvt := p_m;
END IF;

-- Compute SVD of matrix A: A = U * S * Vt
UTL_NLA.LAPACK_GESVD(jobu => v_jobu,
jobvt => 'S',
m => p_m,
n => p_n,
a => p_a,
lda => p_n,
s => p_s,
u => p_u,
ldu => v_dim,
vt => p_vt,
ldvt => v_ldvt,
info => v_info,
pack => p_pack);

-- Scale U to obtain projections
IF p_jobu THEN
v_indx := 0;
FOR i IN 1..p_m LOOP
FOR j IN 1..v_dim LOOP
v_indx := v_indx + 1;
p_u(v_indx) := p_u(v_indx) * p_s(j);
END LOOP;
END LOOP;
END IF;

-- Scale and square S
v_alpha := 1 / SQRT(p_m - 1);
FOR i IN 1..v_dim LOOP
p_s(i) := v_alpha * p_s(i);
p_s(i) := p_s(i) * p_s(i);
END LOOP;
END pca;
The above code uses Singular Value Decomposition (SVD) for computing PCA. It has three parts. First it does SVD calling a single UTL_NLA procedure (UTL_NLA.LAPACK_GESVD). Then it scales the U matrix to obtain the projections. Finally it scales and squares the S matrix to get the variances.

The pca wrapper procedure is a good example of how easy it is to write powerful code using the UTL_NLA package and the read and write procedures in DMT_MATRIX (see technical note 3 below). The full pca wrapper procedure is listed below:
  PROCEDURE pca ( p_source          IN VARCHAR2,
p_iname IN VARCHAR2 DEFAULT NULL,
p_s_tab_name IN VARCHAR2,
p_v_tab_name IN VARCHAR2,
p_u_tab_name IN VARCHAR2,
p_scr_tab_name IN VARCHAR2,
p_score_query OUT NOCOPY VARCHAR2,
p_num_pc OUT NOCOPY PLS_INTEGER
)
AS
v_query LONG;
v_m PLS_INTEGER;
v_n PLS_INTEGER;
v_dim PLS_INTEGER;
v_a UTL_NLA_ARRAY_dbl := utl_nla_array_dbl();
v_s UTL_NLA_ARRAY_dbl := utl_nla_array_dbl();
v_u UTL_NLA_ARRAY_DBL := utl_nla_array_dbl();
v_vt UTL_NLA_ARRAY_DBL := utl_nla_array_dbl();
v_sc UTL_NLA_ARRAY_DBL := utl_nla_array_dbl();
v_col_names dmt_matrix.array;
v_row_ids dmt_matrix.array;
BEGIN
-- Read table into array
dmt_matrix.table2array(
p_source => p_source,
p_a => v_a,
p_m => v_m,
p_n => v_n,
p_col_names => v_col_names,
p_row_ids => v_row_ids,
p_iname => p_iname,
p_force => true);

-- Compute PCA. It only returns information for
-- PCs with non-zero eigenvalues
pca (p_a => v_a,
p_m => v_m,
p_n => v_n,
p_pack => 'R',
p_s => v_s,
p_u => v_u,
p_vt => v_vt,
p_jobu => false);

-- Get the correct number of components returned.
-- This is the min(v_n, v_m-1)
IF (v_m <= v_n) THEN
v_dim := v_m - 1;
ELSE
v_dim := v_n;
END IF;

-- Make scoring query
p_score_query := mk_pca_query(v_col_names, p_iname, v_dim, v_vt);

-- Estimate number of PCs and scree plot information
pca_eval(v_s, v_sc, p_num_pc, v_dim);

-- Persist eigenvalues
-- Number of cols is v_dim
dmt_matrix.array2table(
p_a => v_s,
p_m => 1,
p_n => v_dim,
p_tab_name => p_s_tab_name,
p_col_pfx => 'PC',
p_iname => NULL);

-- Persist PCs (V) (using pack = C transposes the matrix
-- as it was originally R packed)
-- Number of cols in V is v_dim
dmt_matrix.array2table(
p_a => v_vt,
p_m => v_n,
p_n => v_dim,
p_tab_name => p_v_tab_name,
p_col_pfx => 'PC',
p_iname => 'ATTRIBUTE',
p_row_ids => v_col_names,
p_pack => 'C');

-- Persist scores
-- Number of cols in U is v_dim
-- Using the scoring query so that we project all the data
IF p_u_tab_name IS NOT NULL THEN
v_query := 'CREATE TABLE ' || p_u_tab_name ||
' AS ' || p_score_query || '( ' ||
p_source ||')';
EXECUTE IMMEDIATE v_query;
END IF;

-- Persist scree and log-likelihood plot information
dmt_matrix.array2table(
p_a => v_sc,
p_m => v_dim,
p_n => 4,
p_iname => NULL,
p_col_names => array('ID','PCT_VAR','CUM_PCT_VAR',
'LOG_LIKELIHOOD'),
p_tab_name => p_scr_tab_name);
END pca;
The procedure has the following main steps:
  • Read the data from a table into an array
  • Perform PCA on the array
  • Generate the scoring query
  • Estimate the number of PCs and compute the scree and log-likelihood plots
  • Persist the PCA results (eigenvalues, PCs, scores, and scree and log-likelihood plot information)
With the exception of the last one, each one of these steps is implemented with a single procedure call. The last step requires a procedure call for each persisted result. This is a total of three calls to the array2table procedure. In the code I use the scoring query for persisting the scores instead of using the U matrix. This way I am sure that I get projections for all the rows in the original table even for cases where the original table did not fit in memory.

Technical Note 3: Read and Write Procedures
DMT_MATRIX package contains the following procedures for reading and writing data from tables to arrays and vice-versa:
  • table2array - Reads data from a table and stores them in a matrix represented by an array of type UTL_NLA_ARRAY_DBL.
  • table2array2 - Reads data from three columns in a table (or source) and stores them in a matrix represented by an array of type UTL_NLA_ARRAY_DBL. The first column contains, the row index, the second column the column index, and the third the value. The row and column indexes do not have to be numeric. However, there should be only one instance of a given row-column index pair. This procedure is useful for reading data generated by the array2table2 procedure (see below). It is also useful for reading data with more than one thousand attributes or that it is naturally stored in transactional format.
  • array2table - Writes data from a matrix into a table. The table's column names and the matrix row IDs are either specified in the argument list or automatically generated.
  • array2table2 - Writes data from a matrix into a table with 3 or 4 columns. The first (optional) column contains an ID for the matrix. This allows storing many matrices in a single table. The ID for a matrix should be unique for each matrix stored in a table. The second column contains the row index. The third contains the column index. The fourth contains the values.
These procedures greatly simplify coding linear algebra operations using data in tables. The wrapper version of the PCA procedure uses these procedures. The code for the DMT_MATRIX package provides more information on the procedures and dmt_matrix_sample2.sql file has a few examples using them.

Technical Note 4: Chromaticity Diagram
The approach used for Figure 13 is based on chromaticity diagrams used for color spaces. These diagrams enable three-dimensional data to be compressed into a two-dimensional diagram. The caveat is that points in the original three-dimensional space that are scaled versions of each other are mapped to the same point in the two-dimensional diagram. For example the points (0.1, 0.2, 0.2) and (0.2, 0.4, 0.4) are mapped to the same point (0.78, 0.53) in the diagram. These points represent the same overall pattern. The difference between them is a matter of "strength or scale." In the case of colors the difference between the two points is one of luminance. The second point is a brighter version of the first one. In the case of utilities (see Example 3 above), the second utility has the same operational characteristics of the first but it operates at a larger scale.

The graphics in Figure 13 was created using Excel's 2D scatter plot. The data for the graphics was generated as follows:
  • Normalize each PC projection by subtract it minimum and dividing the result by the difference between its maximum and its minimum. This makes the projected points fall inside a cube with sides of length one.
  • Compute the pi projections (i=1, 2, 3) of the points in the cube onto the plane PC1+PC2+PC3 = 1, where (PC1, PC2, PC3) are the normalized PC projections from the previous step:
  • Compute the 2D scatter plot coordinates (T1, T2), where T1 and T2 are, respectively, the horizontal and the vertical distances from the triangle's PC1 corner:
Code
The widget below points to the source code for the dmt_projection and dmt_matrix packages as well as scripts for the examples in the post. There are also links to a dump file containing the data for the examples and a script to create a schema and load the data.

References
[JOH02] Johnson, Richard Arnold, and Wichern, Dean W. (2002). Applied Multivariate Statistical Analysis, Prentice Hall, 5th edition.

[ZHU06] Zhu M., and Ghodsi A. (2006), "Automatic dimensionality selection from the scree plot via the use of profile likelihood." Comput Stat Data Anal, 51(2), 918 - 930.

Labels: , ,

About me

  • Marcos M. Campos: Development Manager for Oracle Data Mining Technologies. Previously Senior Scientist with Thinking Machines. Over the years I have been working on transforming databases into easy to use analytical servers.
  • My profile

Disclaimer

  • Opinions expressed are entirely my own and do not reflect the position of Oracle or any other corporation. The views and opinions expressed by visitors to this blog are theirs and do not necessarily reflect mine.
  • This work is licensed under a Creative Commons license.
  • Creative Commons License

Email-Digest



Feeds

Search


Posts

All Posts

Category Cloud

Links

Locations of visitors to this page
Powered by Blogger
Get Firefox