Collective Intelligence 1: Building a RSS Feed Archive
For a long time I have thought that we needed data mining books written for developers. Most data mining books are written for business or data analysts. Given that, it was a pleasant surprise to read Programming Collective Intelligence: Building Smart Web 2.0 Applications by Toby Segaran. The book provides a good discussion on data mining concepts anchored with interesting examples. It also provides guidance on when to use the techniques. I still think that there is room for improvement on how data mining and analytics should be presented to developers. However, the book is a great step forward in enabling developers to use analytics.
Toby's book has many Web 2.0 examples illustrated with data obtained from different web sites (e.g., blogs, Facebook, Zebo, etc). The examples are coded in Python or rely on freely available software for the data gathering and analytics. This got me thinking. If one wanted to illustrate how to create scalable and robust applications exploiting collective intelligence, would it not be interesting to replicate some of those examples utilizing the analytical and XML processing capabilities in the Oracle database? So I decided to do exactly that. This is the first post in a series showing how we can do the problems discussed in the book using technology in the Oracle database. Although most that is described in the book can be done using the Oracle database, this series will only showcase some of the examples in the book. To implement all of them would be like writing a new book.
Before we can start mining we need to collect data and store it in a database. There are many Web 2.0 websites with interesting data for mining. XML is the de facto data format returned by these websites. This post covers the basic steps on how to get this data by showing how to build an archive of entries from a list of RSS feeds. In later posts I will describe the mining.
First the good news, the Oracle RDBMS has a great set of features supporting XML and HTTP processing. Now the bad news, before writing this post I knew nothing about those features. So, after some digging around the web, I found lots of good information in a couple of articles. I have to say that I am really impressed by what can be done with the functionality in the database.
Below I describe some preparatory steps needed, how to query multiple RSS feeds at once, and finally how to automate the whole process to build a RSS feed archive.
Preparatory Steps
Lucas Jellema has a nice series of articles describing how to build a RSS feed reader and archive (1, 2). I used them as a departing point. The comments in the first article pointed me in the right direction to move from a PL/SQL implementation, as described in the article, to the SQL version given below.
Here is a list of things that one need to do before running the code in Lucas' articles or the code below:
- Configure fine-grained access using ACL (access control lists) - this is needed for 11g
- Take care of proxy server - optional if you have a proxy server
- Configure the database character set
Oracle provides access to external network services through several PL/SQL APIs (UTL_TCP, UTL_SMTP, UTL_MAIL, UTL_HTTP and UTL_INADDR). Before Oracle 11g, access to these external services was based on whether a user was granted execute permissions on a specific package. Oracle 11g introduced fine-grained access to network services using access control lists (ACL). ACL allows to control which users access which network resources regardless of package grants.
Basically we need to do the following to be able to take advantage of the features in these packages:
- Create an access control list (ACL)
- Assign the ACL to the network
CONN sys/manager AS SYSDBA GRANT EXECUTE ON UTL_HTTP TO dmuser; BEGIN DBMS_NETWORK_ACL_ADMIN.create_acl ( acl => 'mining_the_web.xml', description => 'ACL functionality for mining the web', principal => 'DMUSER', is_grant => TRUE, privilege => 'connect', start_date => SYSTIMESTAMP, end_date => NULL); COMMIT; END; /
BEGIN DBMS_NETWORK_ACL_ADMIN.assign_acl ( acl => 'mining_the_web.xml', host => '*', lower_port => 80, upper_port => NULL); COMMIT; END; /The assign_acl call above is very generous. It allows the DMUSER user to connect to any server (*) through port 80. For more details take a look at this nice article by Tim Hall.
Handling Proxies
If your machine sits behind a firewall you need to account for proxies by executing the following command in your session:
EXEC utl_http.set_proxy('proxy.my-company.com', 'corp.my-company.com');For more details see the documentation for the UTL_HTTP package (link).
Configuring Database Character Set
When retrieving XML from RSS feeds it is a good idea to use a database with a Unicode character set, for example AL32UTF8. This prevents getting an error when trying to persist the XML content to a table.
Querying Multiple RSS Feeds
First we create a table rss_feeds to hold the RSS feeds we want to track.
DROP TABLE rss_feeds PURGE; CREATE TABLE rss_feeds (title varchar2(200), url varchar2(500), category varchar2(50));Next we insert a couple of RSS feeds in the table assigning a category to each feed, such as: news, technology, etc.
INSERT INTO rss_feeds VALUES ('iTunes Top 25 Songs', 'http://ax.phobos.apple.com.edgesuite.net/WebObjects/MZStore.woa/wpa/MRSS/topsongs/limit=25/rss.xml','music'); INSERT INTO rss_feeds VALUES ('Apple Hot News', 'http://images.apple.com/main/rss/hotnews/hotnews.rss','Apple'); INSERT INTO rss_feeds VALUES ('Oracle Press Releases', 'http://www.oracle.com/rss/rss_ocom_pr.xml','Oracle'); INSERT INTO rss_feeds VALUES ('OTN Articles', 'http://www.oracle.com/technology/syndication/rss_otn_articles.xml','Oracle'); INSERT INTO rss_feeds VALUES ('CNN', 'http://rss.cnn.com/rss/cnn_topstories.rss','news'); INSERT INTO rss_feeds VALUES ('BBC', 'http://newsrss.bbc.co.uk/rss/newsonline_world_edition/front_page/rss.xml','news'); INSERT INTO rss_feeds VALUES ('NYT', 'http://www.nytimes.com/services/xml/rss/nyt/HomePage.xml','news'); INSERT INTO rss_feeds VALUES ('WPost Technology', 'http://www.washingtonpost.com/wp-dyn/rss/technology/index.xml','technology'); INSERT INTO rss_feeds VALUES ('Slashdot', 'http://rss.slashdot.org/Slashdot/slashdot','technology');Now we can use the SQL statement below to query the RSS feeds and map the XML documents retrieved to a relational schema:
SELECT t.source, p.scategory, it.title, it.link, it.pubdate, it.author, XMLCAST( XMLQUERY('let $b := count(//category) for $i at $pos in //category return ($i, if ($pos = $b) then () else ", ")' PASSING it.category RETURNING CONTENT ) AS VARCHAR2(4000) ) category, it.description FROM (SELECT HTTPURITYPE(url).getXML() doc, category scategory FROM rss_feeds) p, XMLTABLE('/rss/channel' PASSING p.doc COLUMNS source VARCHAR2(4000) PATH 'title', item XMLTYPE PATH 'item') t, XMLTABLE('item' PASSING t.item COLUMNS title VARCHAR2(4000) PATH 'title', link VARCHAR2(4000) PATH 'link', pubDate VARCHAR2(4000) PATH 'pubDate', author VARCHAR2(4000) PATH '*[namespace-uri()="http://purl.org/dc/elements/1.1/" and local-name()="creator"]', category XMLTYPE PATH 'category', description CLOB PATH 'description') it;The above query has a number of interesting features. It uses the HTTPURITYPE function to extract the XML for each RSS feed in the rss_feeds table. The XMLTABLE statements then process the XML document returned from each feed to extract the desired pieces of information and convert the document into a relational source. The first XMLTABLE statement extracts the tile and the item elements from each XML document returned by p. Because each XML document contains multiple items, a second XMLTABLE statement is used to return the pieces of each item element extracted by the first XMLTABLE. This XMLTABLE statement also returns the category as XMLTYPE. This is needed because some sites have multiple category elements for a given item. The category elements are then extracted using the XMLQUERY function so that we have all these elements concatenated as a single string separated by commas. For alternative ways of doing the last bit take a look at this post by Mark Volkmann.
One problem with the above query is that if one of the RSS feed sites is down or is unreachable the whole query will fail. Given the nature of the web this is not an unlikely event if we are trying to collect feeds from many sites. To overcome this problem, we can implement a PL/SQL function that wraps the HTTPURITYPE function and validates the URL before invoking the HTTPURITYPE function. If the site for a URL does not respond, the function returns NULL. The code in the sections below follows this approach.
Building the RSS Feed Archive
To build the RSS feed archive we want to periodically run the above query and persist to a table only the entries that have not been stored already.
First let's create a table for holding the RSS feed entries:
DROP TABLE rss_history PURGE; CREATE TABLE rss_history (source varchar2(4000), scategory varchar2(200), title varchar2(4000), link varchar2(4000), pubDate varchar2(200), author varchar2(4000), category varchar2(4000), description CLOB);Next we define a function to validate a RSS feed's URL and return the XML for the feed.
CREATE OR REPLACE FUNCTION rss_getXML(url IN varchar2) RETURN XMLType IS req UTL_HTTP.REQ; resp UTL_HTTP.RESP; res XMLType := null; BEGIN req := UTL_HTTP.BEGIN_REQUEST(url); resp := UTL_HTTP.GET_RESPONSE(req); IF (resp.status_code >=400 AND resp.status_code <= 599) THEN res := null; ELSE res := httpuritype(url).getXML(); END IF; UTL_HTTP.END_RESPONSE(resp); RETURN res; EXCEPTION WHEN OTHERS THEN UTL_HTTP.END_RESPONSE(resp); RETURN res; END; /We can now use the following MERGE statement to merge only newer entries into the archive:
MERGE INTO rss_history rh USING (SELECT t.source, p.scategory, it.title, it.link, it.pubdate, it.author, XMLCAST( XMLQUERY('let $b := count(//category) for $i at $pos in //category return ($i, if ($pos = $b) then () else ", ")' PASSING it.category RETURNING CONTENT ) AS VARCHAR2(4000) ) category, it.description FROM (SELECT rss_getXML(url) doc, category scategory FROM rss_feeds) p, XMLTABLE('/rss/channel' PASSING p.doc COLUMNS source VARCHAR2(4000) PATH 'title', item XMLTYPE PATH 'item') t, XMLTABLE('item' PASSING t.item COLUMNS title VARCHAR2(4000) PATH 'title', link VARCHAR2(4000) PATH 'link', pubDate VARCHAR2(4000) PATH 'pubDate', author VARCHAR2(4000) PATH '*[namespace-uri()="http://purl.org/dc/elements/1.1/" and local-name()="creator"]', category XMLTYPE PATH 'category', description CLOB PATH 'description') it ) rl ON (rh.link = rl.link) WHEN NOT MATCHED THEN INSERT (title, source, scategory, link, pubDate, author, category, description) VALUES (rl.title, rl.source, rl.scategory, rl.link, rl.pubDate, rl.author, rl.category, rl.description);It is very impressive that a single SQL statement can retrieve entries from multiple RSS feeds, parse the XML, and merge the newer entries into a table for archival.
Finally let's setup a DBMS_SCHEDULER job to run this merge statement every fifteen minutes. First we create a PL/SQL procedure wrapping the above MERGE statement:
CREATE OR REPLACE PROCEDURE update_rss_history (p_proxy IN VARCHAR2, p_no_proxy_domains IN VARCHAR2 DEFAULT NULL) IS BEGIN -- When going through a firewall, pass requests through p_proxy host. -- Specify sites (p_no_proxy_domains) inside the firewall that don't -- need the proxy host. IF (p_proxy is not null) AND (p_no_proxy_domains is not null) THEN UTL_HTTP.SET_PROXY(p_proxy, p_no_proxy_domains); END IF; MERGE INTO rss_history rh USING ( SELECT t.source, p.scategory, it.title, it.link, it.pubdate, it.author, XMLCAST( XMLQUERY('let $b := count(//category) for $i at $pos in //category return ($i, if ($pos = $b) then () else ", ")' PASSING it.category RETURNING CONTENT ) AS VARCHAR2(4000) ) category, it.description FROM (SELECT rss_getXML(url) doc, category scategory FROM rss_feeds) p, XMLTABLE('/rss/channel' PASSING p.doc COLUMNS source VARCHAR2(4000) PATH 'title', item XMLTYPE PATH 'item') t, XMLTABLE('item' PASSING t.item COLUMNS title VARCHAR2(4000) PATH 'title', link VARCHAR2(4000) PATH 'link', pubDate VARCHAR2(4000) PATH 'pubDate', author VARCHAR2(4000) PATH '*[namespace-uri()="http://purl.org/dc/elements/1.1/" and local-name()="creator"]', category XMLTYPE PATH 'category', description CLOB PATH 'description') it ) rl ON (rh.link = rl.link) WHEN NOT MATCHED THEN INSERT (title, source, scategory, link, pubDate, author, category, description) VALUES (rl.title, rl.source, rl.scategory, rl.link, rl.pubDate, rl.author, rl.category, rl.description); END; /Besides the MERGE statement, the above procedure also includes some code for handling proxy servers as explained above.
Next we create the job that will execute this procedure every fifteen minutes:
BEGIN DBMS_SCHEDULER.CREATE_JOB( job_name => 'UPDATE_RSS_ARCHIVE', job_type => 'PLSQL_BLOCK', job_action => 'BEGIN update_rss_history(); COMMIT; END;', enabled => TRUE, start_date => SYSDATE, repeat_interval => 'FREQ=MINUTELY;INTERVAL=15'); END; /For more details on DBMS_SCHEDULER see the Oracle documentation (link).
To check on status of the job we can use the following query:
SELECT job_name, enabled, TO_CHAR(last_start_date,'YY-MM-DD HH24:MI:SS') last_run, TO_CHAR(last_run_duration,'HH:MI:SS') FROM user_scheduler_jobs;To manually stop and/or drop the job we can use the following calls:
-- Stop job EXECUTE DBMS_SCHEDULER.stop_job (job_name => 'update_rss_archive'); -- Drop job EXECUTE DBMS_SCHEDULER.drop_job (job_name => 'update_rss_archive');After a few days running the archive has accumulate quite a few entries:
SELECT COUNT(*) FROM rss_history; COUNT(*) ---------- 1558In later posts I will show how to mine the archive.
Labels: Collective Intelligence, XML
Nice one Marcos. When do you think you'll be able to publish the 2nd part of this article? I know it's a silly question but can't wait for it!
Many thanks
Posted by Anonymous | 10/02/2008 08:23:00 AM
Thanks. I hope it won't take too long.
Posted by Marcos | 10/03/2008 09:48:00 PM
Thanks for this tip, Marcos. It had never occurred to me to attempt something like this, yet to a web-surfer like me, this looks like a killer app for the Oracle Database.
Posted by Anonymous | 10/29/2008 01:10:00 PM
Great! Very useful thanks.
Posted by hard drive backup | 5/12/2009 03:12:00 PM
this is very interesting and I cant wait to hear more
Posted by data mining courses | 6/15/2010 07:24:00 PM