« Home | Data Mining in Action: Oracle Sales Prospector » | Matrix Inversion Using PL/SQL » | Oracle at KDD 2008 and KDD 2007 Pictures » | Recap Post » | KDD 2007 » | On the Road and Upcoming Talks » | Way Cooler: PCA and Visualization - Linear Algebra... » | Webcast Announcement: Oracle's In-Database Statistics » | Webcast Announcement: A Simple Fraud Detection App... » | Way Cool: Linear Algebra in the Oracle Database 1 » 

Monday, September 08, 2008 

Collective Intelligence 1: Building a RSS Feed Archive

For a long time I have thought that we needed data mining books written for developers. Most data mining books are written for business or data analysts. Given that, it was a pleasant surprise to read Programming Collective Intelligence: Building Smart Web 2.0 Applications by Toby Segaran. The book provides a good discussion on data mining concepts anchored with interesting examples. It also provides guidance on when to use the techniques. I still think that there is room for improvement on how data mining and analytics should be presented to developers. However, the book is a great step forward in enabling developers to use analytics.

Toby's book has many Web 2.0 examples illustrated with data obtained from different web sites (e.g., blogs, Facebook, Zebo, etc). The examples are coded in Python or rely on freely available software for the data gathering and analytics. This got me thinking. If one wanted to illustrate how to create scalable and robust applications exploiting collective intelligence, would it not be interesting to replicate some of those examples utilizing the analytical and XML processing capabilities in the Oracle database? So I decided to do exactly that. This is the first post in a series showing how we can do the problems discussed in the book using technology in the Oracle database. Although most that is described in the book can be done using the Oracle database, this series will only showcase some of the examples in the book. To implement all of them would be like writing a new book.

Before we can start mining we need to collect data and store it in a database. There are many Web 2.0 websites with interesting data for mining. XML is the de facto data format returned by these websites. This post covers the basic steps on how to get this data by showing how to build an archive of entries from a list of RSS feeds. In later posts I will describe the mining.

First the good news, the Oracle RDBMS has a great set of features supporting XML and HTTP processing. Now the bad news, before writing this post I knew nothing about those features. So, after some digging around the web, I found lots of good information in a couple of articles. I have to say that I am really impressed by what can be done with the functionality in the database.

Below I describe some preparatory steps needed, how to query multiple RSS feeds at once, and finally how to automate the whole process to build a RSS feed archive.

Preparatory Steps
Lucas Jellema has a nice series of articles describing how to build a RSS feed reader and archive (1, 2). I used them as a departing point. The comments in the first article pointed me in the right direction to move from a PL/SQL implementation, as described in the article, to the SQL version given below.

Here is a list of things that one need to do before running the code in Lucas' articles or the code below:

  • Configure fine-grained access using ACL (access control lists) - this is needed for 11g
  • Take care of proxy server - optional if you have a proxy server
  • Configure the database character set
Configuring Fine-Grained Access
Oracle provides access to external network services through several PL/SQL APIs (UTL_TCP, UTL_SMTP, UTL_MAIL, UTL_HTTP and UTL_INADDR). Before Oracle 11g, access to these external services was based on whether a user was granted execute permissions on a specific package. Oracle 11g introduced fine-grained access to network services using access control lists (ACL). ACL allows to control which users access which network resources regardless of package grants.

Basically we need to do the following to be able to take advantage of the features in these packages:
  • Create an access control list (ACL)
  • Assign the ACL to the network
CONN sys/manager AS SYSDBA
GRANT EXECUTE ON UTL_HTTP TO dmuser;
BEGIN
  DBMS_NETWORK_ACL_ADMIN.create_acl (
    acl          => 'mining_the_web.xml', 
    description  => 'ACL functionality for mining the web',
    principal    => 'DMUSER',
    is_grant     => TRUE, 
    privilege    => 'connect',
    start_date   => SYSTIMESTAMP,
    end_date     => NULL);

  COMMIT;
END;
/
BEGIN
  DBMS_NETWORK_ACL_ADMIN.assign_acl (
    acl         => 'mining_the_web.xml',
    host        => '*', 
    lower_port  => 80,
    upper_port  => NULL); 

  COMMIT;
END;
/
The assign_acl call above is very generous. It allows the DMUSER user to connect to any server (*) through port 80. For more details take a look at this nice article by Tim Hall.

Handling Proxies
If your machine sits behind a firewall you need to account for proxies by executing the following command in your session:
EXEC utl_http.set_proxy('proxy.my-company.com', 'corp.my-company.com');
For more details see the documentation for the UTL_HTTP package (link).

Configuring Database Character Set
When retrieving XML from RSS feeds it is a good idea to use a database with a Unicode character set, for example AL32UTF8. This prevents getting an error when trying to persist the XML content to a table.

Querying Multiple RSS Feeds
First we create a table rss_feeds to hold the RSS feeds we want to track.
DROP TABLE rss_feeds PURGE;
CREATE TABLE rss_feeds
(title    varchar2(200), 
 url      varchar2(500), 
 category varchar2(50));
Next we insert a couple of RSS feeds in the table assigning a category to each feed, such as: news, technology, etc.
INSERT INTO rss_feeds VALUES ('iTunes Top 25 Songs', 'http://ax.phobos.apple.com.edgesuite.net/WebObjects/MZStore.woa/wpa/MRSS/topsongs/limit=25/rss.xml','music');
INSERT INTO rss_feeds VALUES ('Apple Hot News', 'http://images.apple.com/main/rss/hotnews/hotnews.rss','Apple');
INSERT INTO rss_feeds VALUES ('Oracle Press Releases', 'http://www.oracle.com/rss/rss_ocom_pr.xml','Oracle');
INSERT INTO rss_feeds VALUES ('OTN Articles', 'http://www.oracle.com/technology/syndication/rss_otn_articles.xml','Oracle');
INSERT INTO rss_feeds VALUES ('CNN', 'http://rss.cnn.com/rss/cnn_topstories.rss','news');
INSERT INTO rss_feeds VALUES ('BBC', 'http://newsrss.bbc.co.uk/rss/newsonline_world_edition/front_page/rss.xml','news');
INSERT INTO rss_feeds VALUES ('NYT', 'http://www.nytimes.com/services/xml/rss/nyt/HomePage.xml','news');
INSERT INTO rss_feeds VALUES ('WPost Technology', 'http://www.washingtonpost.com/wp-dyn/rss/technology/index.xml','technology');
INSERT INTO rss_feeds VALUES ('Slashdot', 'http://rss.slashdot.org/Slashdot/slashdot','technology');
Now we can use the SQL statement below to query the RSS feeds and map the XML documents retrieved to a relational schema:
SELECT t.source, p.scategory, it.title, it.link, it.pubdate, 
       it.author,
       XMLCAST(
         XMLQUERY('let $b := count(//category)
                   for $i at $pos in //category
                     return ($i, if ($pos = $b) then () else ", ")'
                  PASSING it.category RETURNING CONTENT
                 ) AS VARCHAR2(4000)
               ) category,
       it.description
FROM (SELECT HTTPURITYPE(url).getXML() doc, category scategory 
      FROM rss_feeds) p,
     XMLTABLE('/rss/channel' PASSING p.doc
              COLUMNS source VARCHAR2(4000) PATH 'title',
                      item XMLTYPE          PATH 'item') t,
     XMLTABLE('item' PASSING t.item
              COLUMNS title       VARCHAR2(4000) PATH 'title',
                      link        VARCHAR2(4000) PATH 'link',
                      pubDate     VARCHAR2(4000) PATH 'pubDate',
                      author      VARCHAR2(4000) PATH '*[namespace-uri()="http://purl.org/dc/elements/1.1/" and local-name()="creator"]',
                      category    XMLTYPE        PATH 'category',
                      description CLOB           PATH 'description') it;
The above query has a number of interesting features. It uses the HTTPURITYPE function to extract the XML for each RSS feed in the rss_feeds table. The XMLTABLE statements then process the XML document returned from each feed to extract the desired pieces of information and convert the document into a relational source. The first XMLTABLE statement extracts the tile and the item elements from each XML document returned by p. Because each XML document contains multiple items, a second XMLTABLE statement is used to return the pieces of each item element extracted by the first XMLTABLE. This XMLTABLE statement also returns the category as XMLTYPE. This is needed because some sites have multiple category elements for a given item. The category elements are then extracted using the XMLQUERY function so that we have all these elements concatenated as a single string separated by commas. For alternative ways of doing the last bit take a look at this post by Mark Volkmann.

One problem with the above query is that if one of the RSS feed sites is down or is unreachable the whole query will fail. Given the nature of the web this is not an unlikely event if we are trying to collect feeds from many sites. To overcome this problem, we can implement a PL/SQL function that wraps  the HTTPURITYPE function and validates the URL before invoking the HTTPURITYPE function. If the site for a URL does not respond, the function returns NULL. The code in the sections below follows this approach.

Building the RSS Feed Archive
To build the RSS feed archive we want to periodically run the above query and persist to a table only the entries that have not been stored already.

First let's create a table for holding the RSS feed entries:
DROP TABLE rss_history PURGE;
CREATE TABLE rss_history
(source      varchar2(4000),
 scategory   varchar2(200),
 title       varchar2(4000),
 link        varchar2(4000),
 pubDate     varchar2(200),
 author      varchar2(4000),
 category    varchar2(4000),
 description CLOB);
Next we define a function to validate a RSS feed's URL and return the XML for the feed.
CREATE OR REPLACE FUNCTION rss_getXML(url IN varchar2) 
                           RETURN XMLType
IS
  req   UTL_HTTP.REQ;
  resp  UTL_HTTP.RESP;
  res   XMLType := null;
BEGIN
  req  := UTL_HTTP.BEGIN_REQUEST(url);
  resp := UTL_HTTP.GET_RESPONSE(req);  
  IF (resp.status_code >=400 AND
      resp.status_code <= 599) 
  THEN
    res := null;
  ELSE
    res := httpuritype(url).getXML();  
  END IF;
  UTL_HTTP.END_RESPONSE(resp);  
  RETURN res;

  EXCEPTION
    WHEN OTHERS THEN
      UTL_HTTP.END_RESPONSE(resp);
      RETURN res;
END;
/
We can now use the following MERGE statement to merge only newer entries into the archive:
MERGE 
INTO rss_history rh 
USING (SELECT t.source, p.scategory, it.title, it.link, it.pubdate, 
              it.author, 
              XMLCAST(
                XMLQUERY('let $b := count(//category)
                          for $i at $pos in //category 
                            return ($i, if ($pos = $b) then () else ", ")'
                         PASSING it.category RETURNING CONTENT
                        ) AS VARCHAR2(4000) 
                     ) category,
              it.description
       FROM (SELECT rss_getXML(url) doc, category scategory 
             FROM rss_feeds) p,
            XMLTABLE('/rss/channel' PASSING p.doc
                     COLUMNS source VARCHAR2(4000) PATH 'title',
                             item XMLTYPE PATH 'item') t,
            XMLTABLE('item' PASSING t.item
                     COLUMNS title       VARCHAR2(4000) PATH 'title',
                             link        VARCHAR2(4000) PATH 'link',
                             pubDate     VARCHAR2(4000) PATH 'pubDate',
                             author      VARCHAR2(4000) PATH '*[namespace-uri()="http://purl.org/dc/elements/1.1/" and local-name()="creator"]',
                             category    XMLTYPE        PATH 'category',
                             description CLOB           PATH 'description') it
      ) rl
ON (rh.link = rl.link)
WHEN NOT MATCHED 
THEN INSERT (title, source, scategory, link, pubDate, 
             author, category, description)
     VALUES (rl.title, rl.source, rl.scategory, rl.link, rl.pubDate, 
             rl.author, rl.category, rl.description);
It is very impressive that a single SQL statement can retrieve entries from multiple RSS feeds, parse the XML, and merge the newer entries into a table for archival. 

Finally let's setup a DBMS_SCHEDULER job to run this merge statement every fifteen minutes. First we create a PL/SQL procedure wrapping the above MERGE statement:
CREATE OR REPLACE PROCEDURE update_rss_history
 (p_proxy            IN VARCHAR2,
  p_no_proxy_domains IN VARCHAR2 DEFAULT NULL) 
IS
BEGIN
  -- When going through a firewall, pass requests through p_proxy host.
  -- Specify sites (p_no_proxy_domains) inside the firewall that don't 
  -- need the proxy host.
  IF (p_proxy is not null) AND
     (p_no_proxy_domains is not null)
  THEN
    UTL_HTTP.SET_PROXY(p_proxy, p_no_proxy_domains);
  END IF;  
  
  MERGE 
  INTO rss_history rh 
  USING ( SELECT t.source, p.scategory, it.title, it.link, it.pubdate, 
                 it.author, 
                 XMLCAST(
                   XMLQUERY('let $b := count(//category)
                             for $i at $pos in //category 
                               return ($i, if ($pos = $b) then () 
                                           else ", ")'
                            PASSING it.category RETURNING CONTENT
                           ) AS VARCHAR2(4000) 
                        ) category,
                 it.description
          FROM (SELECT rss_getXML(url) doc, 
                       category scategory 
                FROM rss_feeds) p,
               XMLTABLE('/rss/channel' PASSING p.doc
                        COLUMNS source VARCHAR2(4000) PATH 'title',
                                item XMLTYPE PATH 'item') t,
               XMLTABLE('item' PASSING t.item
                        COLUMNS title       VARCHAR2(4000) PATH 'title',
                                link        VARCHAR2(4000) PATH 'link',
                                pubDate     VARCHAR2(4000) PATH 'pubDate',
                                author      VARCHAR2(4000) PATH '*[namespace-uri()="http://purl.org/dc/elements/1.1/" and local-name()="creator"]',
                                category    XMLTYPE        PATH 'category',
                                description CLOB           PATH 'description') it
        ) rl
  ON (rh.link = rl.link)
  WHEN NOT MATCHED 
  THEN INSERT (title, source, scategory, link, pubDate, 
               author, category, description)
       VALUES (rl.title, rl.source, rl.scategory, rl.link, rl.pubDate, 
               rl.author, rl.category, rl.description);  
END;     
/
Besides the MERGE statement, the above procedure also includes some code for handling proxy servers as explained above.

Next we create the job that will execute this procedure every fifteen minutes:
BEGIN
DBMS_SCHEDULER.CREATE_JOB(
   job_name        =>  'UPDATE_RSS_ARCHIVE',
   job_type        =>  'PLSQL_BLOCK',
   job_action      =>  'BEGIN 
                          update_rss_history();
                          COMMIT;
                        END;',
   enabled         =>  TRUE,
   start_date      =>  SYSDATE,
   repeat_interval => 'FREQ=MINUTELY;INTERVAL=15');
END;
/
For more details on DBMS_SCHEDULER see the Oracle documentation (link). 

To check on status of the job we can use the following query:
SELECT job_name, enabled, 
       TO_CHAR(last_start_date,'YY-MM-DD HH24:MI:SS') last_run,
       TO_CHAR(last_run_duration,'HH:MI:SS') 
FROM user_scheduler_jobs;
To manually stop and/or drop the job we can use the following calls:
-- Stop job
EXECUTE DBMS_SCHEDULER.stop_job (job_name => 'update_rss_archive');

-- Drop job
EXECUTE DBMS_SCHEDULER.drop_job (job_name => 'update_rss_archive');
After a few days running the archive has accumulate quite a few entries:
SELECT COUNT(*) FROM rss_history;

  COUNT(*)
----------
      1558
In later posts I will show how to mine the archive.

Labels: ,

Nice one Marcos. When do you think you'll be able to publish the 2nd part of this article? I know it's a silly question but can't wait for it!

Many thanks

Thanks. I hope it won't take too long.

Thanks for this tip, Marcos. It had never occurred to me to attempt something like this, yet to a web-surfer like me, this looks like a killer app for the Oracle Database.

Great! Very useful thanks.

this is very interesting and I cant wait to hear more

Post a Comment

About me

  • Marcos M. Campos: Development Manager for Oracle Data Mining Technologies. Previously Senior Scientist with Thinking Machines. Over the years I have been working on transforming databases into easy to use analytical servers.
  • My profile

Disclaimer

  • Opinions expressed are entirely my own and do not reflect the position of Oracle or any other corporation. The views and opinions expressed by visitors to this blog are theirs and do not necessarily reflect mine.
  • This work is licensed under a Creative Commons license.
  • Creative Commons License

Email-Digest



Feeds

Search


Posts

All Posts

Category Cloud

Links

Locations of visitors to this page
Powered by Blogger
Get Firefox