Tuesday, January 31, 2006 

Blog Changes

I've added support for categories. Because Blogger does not support categories this requires some hacking. Phydeaux3 has a great tutorial on how to go about adding category support to Blogger using Del.ico.us.

As of now, most blog posts come with a list of the categories they belong to. You can browse through the posts for a category using the pulldown menu under Posts on the sidebar, or by clicking on the category name on the Category Cloud, also on the sidebar.

Wednesday, January 25, 2006 

A Couple of Papers on Oracle Analytics Available on OTN

The following papers are available on OTN (link to site):

  • Adding Data Mining to Extend Your OLAP BI Solution (link)
  • Data-Centric Automated Data Mining (link)
  • Mining High-Dimensional Data for Information Fusion (link)
  • Support Vector Machines in Oracle Database 10g (link)
  • Data Mining-Based Intrusion Detection (link)
  • Oracle9i O-Cluster: Scalable Clustering of Large High Dimensional Data Sets (link)
  • Clustering Large Databases with Numeric and Nominal Values Using Orthogonal Projections (link)
Presentations on some of these papers can also be found on the above OTN site.

The first paper shows how data mining can be leveraged to select relevant dimensions for creating OLAP cubes.

The second paper proposes a new approach to the design of data mining applications targeted at database and business intelligence users. This approach uses a data-centric focus and automated methodologies to make data mining accessible to non-experts.

The third paper shows how the RDMBS provides an effective platform for building information fusion applications. It demonstrates the approach on satellite imagery using a combination of data mining and spatial processing components.

The fourth paper presents Oracle’s implementation of SVM where the primary focus lies on ease of use and scalability while maintaining high performance accuracy.

The fifth paper introduces DAID, a database-centric architecture that leverages data mining within the Oracle RDBMS to address the challenges that exist in the design and implementation of production quality intrusion detection systems. DAID also offers numerous advantages in terms of scheduling capabilities, alert infrastructure, data analysis tools, security, scalability, and reliability.

The last two papers describes O-Cluster, a clustering algorithm part of Oracle Data Mining that can scale to large number of attributes and rows.

Readings: Business intelligence, Data mining, Oracle analytics

Monday, January 23, 2006 

Time Series Forecasting - Part 1

This is Part 1 in a series on time series forecasting - The full series is Part 1, Part 2, and Part 3.

Time series forecasting is supported in the Oracle Database by Oracle OLAP FORECAST command and by Oracle Data Mining (ODM). The FORECAST command can be used to forecast data by one of three methods: straight-line trend, exponential growth, or Holt-Winters extrapolation. FORECAST performs the calculation according to the selected method and optionally stores the result in a variable in your analytic workspace. The first two methods are simple extrapolation techniques. The Holt-Winters forecasting method is more sophisticated. It is a type of exponential smoothing or moving average technique. The Holt-Winters method constructs three statistically related series, which are used to make the actual forecast. These series are:

  1. The smoothed data series, which is the original data with seasonal effects and random error removed
  2. The seasonal index series, which is the seasonal effect for each period. A value greater than one represents a seasonal increase in the data for that period, and a value less than one is a seasonal decrease in the data
  3. The trend series, which is the change in the data for each period with the seasonal effects and random error removed
The methods supported by the FORECAST command are "univariate time series" methods. That is, they can only be used to model a time series that consists of single (scalar) observations recorded sequentially over equal time increments. The FORECAST command methods are also linear in nature and cannot capture complex relationships between inputs and outputs.

ODM, through its Support Vector Machine (SVM) regression functionality, provides a powerful non-linear technique for time series forecast that could include other variables besides the series itself and captures complex relationships. The rest of this post covers the data mining approach to time series modeling. This post is part of a two-post series. In the next post I will give an example of time series forecasting using ODM and the approach described below.

Data Mining Approach

ODM SVM regression supports modeling of time series via a time delay or lag space approach. This approach is also called "state-space reconstruction" in the physics community and "tapped delay line" in the engineering community. In its simplest form, past values of the target (the time series we want to forecast) are used as inputs (predictors) to the model. These inputs are called lagged variables and can be easily computed using the SQL LAG analytic function. Other attributes that are also relevant for forecasting the series can be added in the same fashion. Suppose we are trying to forecast the maximum daily electrical load based on electrical load values and average daily temperatures. Following the above approach, for a given date, we could use, for example, the load values and the average daily temperatures for the previous two days as inputs. This is illustrated in the table below where the lagged values are computed using the SQL LAG analytic function. The data shows maximum load values (Y) and average daily temperatures (X) for 10 days.

DAY Y LAG(Y,1) LAG(Y,2) X LAG(X,1) LAG(X,2)
1 797 . . -7.6 . .
2 777 797 . -6.3 -7.6 .
3 797 777 797 -3.0 -6.3 -7.6
4 757 797 777 0.7 -3.0 -6.3
5 707 757 797 -1.9 0.7 -3.0
6 730 707 757 -6.0 -1.9 0.7
7 818 730 707 -6.2 -6.0 -1.9
8 818 818 730 -3.9 -6.2 -6.0
9 803 818 818 -6.3 -3.9 -6.2
10 804 803 818 -1.1 -6.3 -3.9

In some cases, the values for auxiliary attributes (X), like the average temperature in the example above, would not be known at the time we are trying to forecast the target (Y) and would therefore not be included among the inputs. However, we would still be able to use the lagged values of X. Once the attributes have been selected, we can train the SVM model with these target (Y) and predictor attributes, in the example above the predictors are: LAG(Y,1), LAG(Y,2), X, LAG(X,1), LAG(X,2). The data would be split into training and test data sets. Usually one would train on earlier dates and test on later ones. Alternatively, for one-step ahead forecast testing (more on this below), the training and test data sets can be randomly selected from all available data. A SVM regression model that the only inputs are lagged target (Y) values is called an "autoregressive model." The input space that includes all of the lagged variables is called the "embedding space."

Things get a bit more complicated if the data rows in the time series are not equally spaced, that is, the time interval between observations is not the same. One approach is to use a smoothing technique to compute values for the attributes at equally spaced time intervals, and then use the interpolated values for training instead of the original data.

Methodology

When modeling time series, following the above approach, it is necessary to make decisions regarding:
  1. Trend removal
  2. Target transformation
  3. Lagged attribute selection
These decisions are required for the majority of time series forecasting techniques.

Trend Removal

A key fact for the above time delay approach to be effective is the assumption that the time series is stationary. This implies that the statistical distribution of the time series values at the various time intervals is the same. In particular, this means that the time series does not have a trend. In practice, many time series exhibit trend. For example, many financial indicators, such as stock prices, usually go up over time. A trend component in the time series means that the series values tend to go up over time, or that the series values tend to go down over time. The simplest method is called differencing and is the standard statistical method for handling nondeterministic (stochastic) trends. In this case, instead of using Y (the time series value) as a target, we use the difference D = Y-LAG(Y,1) for the target. The same applies to the target lagged values. For example, instead of using LAG(Y,1) as a predictor we would use LAG(Y,1)-LAG(Y,2). Sometimes it is necessary to compute differences of differences. At apply time the differencing of the target can be reversed to obtain forecasts for the original series.

Target Transformation
It is usually useful to normalize the target for SVM regression. This helps speed up the algorithm convergence. For time series problems, the target should be normalized prior to the creation of the lagged variables.

Lagged Attribute Selection
You can either select the lags by analyzing the data (compute correlogram and cross-correlations) or by selecting a window size. For example, if we use a window of size 2 we would include LAG(Y,1) and LAG(Y,2) as predictors, where Y is the target attribute. Some care is needed in choosing the window size. The window size directly affects the pattern recognition capability of the SVM algorithm. It limits the size of the patterns that can be recognized. If the window is too small we might not have enough information to capture the dynamics of the system underlying the time series data. Different patterns may look the same as only a small fraction of the pattern is revealed by the lagged attributes. If the window is too large, the extra lagged attributes will add noise and make the problem harder to solve.

Computing Forecasts

There are several different ways to compute forecasts. The two most commonly used strategies are: one-step-ahead (open-loop) forecasting and multi-step (closed-loop) forecasting.

Single-Step or Open-Loop Forecasting
This strategy requires all the inputs values to the model to be available. If the previous value of the target is included in the model we can only make forecasts for the next time interval, thus the single-step name. For the demand forecast example above, we would only be able to forecast the demand for one day in the future (Day 11). In order to compute Y_12, the forecast for Day 12, we would need to wait until the actual values for Day 11 were available. In other words, let say that we have trained a SVM regression model with target Y and inputs LAG(Y,1) and LAG(Y,2). Let the output (predicted value) computed by the model prediction be designated by P(.,.). Then:
  • forecast Y_11 as P(Y_10,Y_9)
  • forecast Y_12 as P(Y_11,Y_10)
  • and so on
Single-step forecasts can be directly computed using Oracle Data Miner's Apply and Test mining activities or the SQL Prediction function. This is illustrated in Part 2 of this series.

Multi-step or Closed-Loop Forecasting
This strategy uses actual values for the inputs when available and estimates or predicted values when the actual values are not available. Let's say that we have trained a SVM regression model with target Y and inputs LAG(Y,1) and LAG(Y,2). Again, let the output (predicted value) computed by the model prediction be designated by P(.,.). Then:
  • forecast Y_11 as P_11 = P(Y_10,Y_9)
  • forecast Y_12 as P_12 = P(P_11,Y_10)
  • and so on
In this case we can generate a forecast for Y_12 even if we do not have the actual value for Y_11. The strategy is to use the predictions for previous time steps as estimates for the missing inputs.

Multi-step forecasts can be computed using a simple PL/SQL procedure. This is illustrated in Part 3 of this series.

Comparison with Traditional Time Series Techniques

SVM regression offers the same benefits for time series forecasting as those of feedfoward neural networks, but with simpler training. The advantages of using such models include:
  • The ability to model very complex functions
  • The ability to use a large number of variables in the model and to include other data (i.e., fundamental and technical factors) in addition to lagged time series data
The methodology described here creates non-linear autoregressive models. These models are very powerful and have been used in areas such as: financial forecast, electric load forecast, chaos modeling, and sunspot prediction. In contrast, ARIMA, a popular time series forecasting technique, supports models with both autoregressive and moving average components. However, ARIMA models are linear, while SVM regression models can capture non-linear relationships.

This is Part 1 in a series on time series forecasting - The full series is Part 1, Part 2, and Part 3


Readings: Business intelligence, Data mining, Oracle analytics

Labels:

Monday, January 16, 2006 

Oracle Data Mining JDeveloper Extension on OTN

Just out: Oracle Java Data Mining (OJDM) API, part of Oracle 10gR2 Data Mining , is now available as an extension for JDeveloper.

This is a brief description straight from the OTN site with instructions on how to get and install the JDeveloper extension:

This extension installs all the Java libraries necessary for developing advanced analytics applications using the JSR-73-compliant API in the JDeveloper environment. In addition to providing access to all the data mining functionality of ODM, OJDM makes developing advanced analytics applications easier by including support for both synchronous and asynchronous data mining tasks, such as model building, testing, and batch and real-time scoring. Models produced through the Java API are fully interoperable with the PL/SQL API as well as the Oracle Data Miner GUI.

Readings: Business intelligence, Data mining, Oracle analytics

Tuesday, January 10, 2006 

What Is Data Mining?

I have seen this question asked many times. This question has also created a popular thread in the Oracle Data Mining forum. In this post I'll discuss what data mining is and is not. I will also try to contrast data mining with other activities such as OLAP and Statistics.

Data mining has been a buzzword for sometime now. The term has been used and misused in many different contexts. Some definitions of data mining include:

"Data mining, also known as knowledge-discovery in databases (KDD), is the practice of automatically searching large stores of data for patterns. To do this, data mining uses computational techniques from statistics, machine learning and pattern recognition" [1].

"The nontrivial extraction of implicit, previously unknown, and potentially useful information from data" [2].

"The science of extracting useful information from large data sets or databases" [3].
The above definitions highlight some key elements associated with the data mining activity:
  • Automatic discovery of patterns
  • Discovery of patterns that are not easy to detect (non-trivial)
  • Creation of actionable information
  • Focus on large data sets and databases
In summary, data mining is about finding patterns in data that are not easily spotted. Many times query and reporting has been incorrectly called data mining. But to its true nature, data mining implies building models, thus the notion of automatic discovery. Data mining models create abstractions (actionable information) of the data that can be used to answer questions that one would not be able to ask the data directly using just a query.

Data Mining and Statistics
There is a great deal of overlap between data mining and Statistics. In fact most of the techniques used in data mining can be placed in a statistical framework. However, data mining techniques are not amongst traditional statistical techniques. This can lead to the impression that data mining and statistics are competing disciples. Traditional statistical method, in general, require a great deal of user interaction in order to validate the correctness of the model, and thus they are harder to automate. These methods also do not usually scale well to very large datasets. On the other hand, data mining methods are suitable for large datasets and can be more readily automated. In fact, data mining algorithms, in many cases, require large data sets for the creation of quality models.

Data Mining and OLAP
On-Line Analytical Processing (OLAP) can been defined as fast analysis of shared multidimensional data [4]. OLAP and data mining are different but complementary activities. OLAP analysis may include time series analysis, cost allocations, currency translation, goal seeking, ad hoc multidimensional structural changes, non-procedural modeling, exception alerting, and data mining. However, most OLAP systems do not have inductive inference (or data mining) capabilities beyond the support for time-series forecast.

Multidimensional data is a key concept in OLAP. OLAP systems provide a multidimensional conceptual view of the data, including full support for hierarchies and multiple hierarchies. This view of the data is a natural way to analyze businesses and organizations. Data mining systems (and models), on the other hand, usually do not have a concept of dimensions and hierarchies.

Data mining and OLAP can be integrated in a number of ways. For example, data mining can be used to select the dimensions for a cube, create new attributes for a dimension, and create new measures for a cube. OLAP can be used to analyze data mining results at different levels of granularity.

In future posts I will describe how to mine OLAP data and to analyze data mining results using OLAP techniques.

Readings: Business intelligence, Data mining, Oracle analytics

Thursday, January 05, 2006 

Analytics in the Oracle Database

The addition of analytics to databases is a natural direction. As the volume of data increases, data movement dominates the cost of computation. It starts to make more sense to move the computation and algorithms to the database than to move, to an external server, the data to be analyzed. Furthermore, in most cases, the results of the analysis need to be persisted back to the database and combined with other data in order to make them actionable.

Picking up on this, over the past releases Oracle has continuously added analytic features to its database. Taken as a whole these features make the Oracle Database a powerful platform for developing applications leveraging analytics. However, most users are not aware of the complete set of analytic features available in the database. The following list covers the features present in the Oracle Database 10g Release 2:

  • Complex data transformations
  • Data mining
  • Image feature extraction
  • Linear algebra
  • OLAP
  • Predictive analytics
  • Spatial analytics
  • Statistical functions
  • Text mining
As these features are part of a common server it is possible to combine them efficiently and with ease. The overall benefit is greater than just the sum of the parts that could be achieved through the integration of different servers and tools. For example, it is possible to create efficient arbitrarily complex SQL statements that combine data mining and text processing:
SELECT A.cust_name, A.contact_info
FROM customers A
WHERE PREDICTION_PROBABILITY(tree_model,
‘attrite’ USING A.*) > 0.8
AND A.cust_value > 90
AND A.cust_id IN
(SELECT B.cust_id
FROM call_center B
WHERE B.call_date BETWEEN ’01-Jan-2005’
AND ’30-Jun-2005’
AND CONTAINS(B.notes, ‘Checking Plus’, 1) > 0);
The above query selects all customers who have a high propensity to attrite (> 80% chance), are valuable customers (customer value rating > 90), and have had a recent conversation with customer services regarding a Checking Plus account. The propensity to attrite information is computed using a data mining model (tree_model). The query uses Oracle Text's CONTAINS operator to search call center notes for references to Checking Plus accounts.

Finally, it is also easy to integrate the results from queries like the one above with Business Intelligence tools such as Oracle Discover, Oracle Portal, and Crystal Reports (more on that in future posts).

In future posts I will cover:
  • How to get the most out of many of these features
  • How to solve real problems using analytics
  • The role of analytics in Business Intelligence and databases
In the meantime, the following provides a brief description of each one of the above features with links for further information.

Complex Data Transformations
Data transformation is a key aspect of analytical applications and ETL (extract, transform, and load). Besides support for transformations through SQL expressions, the Oracle Database, since the Oracle Database 10g Release 1, ships with a flexible data transformation package that includes a variety of missing value and outlier treatments, as well as binning and normalization capabilities.

Data Mining
Oracle Data Mining (ODM), an option to the Enterprise Edition of the Oracle Database, provides a rich set of data mining functionality. ODM 10g Release 2 has eleven algorithms that can be used for classification, regression, clustering, anomaly detection, feature extraction, association analysis, and attribute ranking.

The database also includes in both Standard Edition and Enterprise Edition the frequent itemset package (DBMS_FREQUENT_ITEMSET). This package enables frequent itemset counting and it is used as a building block for ODM's Association algorithm. Frequent itemsets provide a mechanism for counting how often multiple events occur together. This blog post has a nice discussion of this feature.

Image Feature Extraction
Oracle Intermedia is a feature of the Oracle Database that is included in both Standard Edition and Enterprise Edition. interMedia supports the extraction of image features (e.g., color histogram, texture, and positional color) that can then be used to characterize and analyze images.

Linear Algebra
Oracle Database 10g Release 2 ships with a new package UTL_NLA. The UTL_NLA package exposes a subset of the popular BLAS and LAPACK (Version 3.0) libraries for operations on vectors and matrices represented as VARRAYs. This package includes procedures to solve systems of linear equations, invert matrices, and compute eigenvalues and eigenvectors.

Predictive Analytics
Data mining can uncover useful information buried in vast amounts of data. However, it is often the case that many users that could benefit from these results do not have any data mining expertise. The DBMS_PREDICTIVE_ANALYTICS package addresses this by automating the entire data mining process from data preprocessing through model building to scoring new data. This package provides an important tool that makes data mining possible for a wider audience of users, in particular, business analysts. The capabilities of this package are also exposed through the Oracle Spreadsheet Add-In for Predictive Analytics. The Oracle Spreadsheet Add-In for Predictive Analytics enables Microsoft Excel users to mine their Oracle Database or Excel data using simple, "one click" Predict and Explain predictive analytics features.

Statistical Functions
The Oracle Database provides a long list of SQL statistical functions with support for: hypothesis testing (e.g., t-test, F-test), correlation computation (e.g., pearson correlation), cross-tab statistics, and descriptive statistics (e.g., median and mode). The package DBMS_STAT_FUNCS adds distribution fitting procedures and a summary procedure that returns descriptive statistics for a column.

OLAP
Oracle OLAP, an option to Oracle Database 10g Enterprise Edition, has features previously found only in specialized OLAP databases. Moving beyond drill-downs and roll-ups, Oracle OLAP also supports time-series modeling and forecast.

Text Mining
Oracle Text uses standard SQL to index, search, and analyze text and documents stored in the Oracle database, in files, and on the web. It also supports automatic classification and clustering of document collections. Many of these analytical features are layered on top of ODM functionality.

Spatial Analytics
Oracle Spatial is an option for Oracle Enterprise Edition that provides advanced spatial features to support high-end GIS and LBS solutions. Oracle Spatial's analysis and mining capabilities include functions for binning, detection of regional patterns, spatial correlation, colocation mining, and spatial clustering. Oracle Spatial also includes support for topology and network data models and analytics. The topology data model of Oracle Spatial allows one to work with data about nodes, edges, and faces in a topology. It includes network analysis functions for computing shortest path, minimum cost spanning tree, nearest-neighbors analysis, traveling salesman problem, among others.

Readings: Business intelligence, Data mining, Oracle analytics

Sunday, January 01, 2006 

Real Time Scoring & Model Management Series

Categories



Time Series Forecasting Series

Welcome!!

Welcome post - Blog launch.

All Posts

About me

  • Marcos M. Campos: Development Manager for Oracle Data Mining Technologies. Previously Senior Scientist with Thinking Machines. Over the years I have been working on transforming databases into easy to use analytical servers.
  • My profile

Disclaimer

  • Opinions expressed are entirely my own and do not reflect the position of Oracle or any other corporation. The views and opinions expressed by visitors to this blog are theirs and do not necessarily reflect mine.
  • This work is licensed under a Creative Commons license.
  • Creative Commons License

Email-Digest



Feeds

Search


Posts

All Posts

Category Cloud

Links

Locations of visitors to this page
Powered by Blogger
Get Firefox