« Home | Analytics in the Oracle Database » | Real Time Scoring & Model Management Series » | Categories » | Time Series Forecasting Series » | Welcome!! » | All Posts » 

Tuesday, January 10, 2006 

What Is Data Mining?

I have seen this question asked many times. This question has also created a popular thread in the Oracle Data Mining forum. In this post I'll discuss what data mining is and is not. I will also try to contrast data mining with other activities such as OLAP and Statistics.

Data mining has been a buzzword for sometime now. The term has been used and misused in many different contexts. Some definitions of data mining include:

"Data mining, also known as knowledge-discovery in databases (KDD), is the practice of automatically searching large stores of data for patterns. To do this, data mining uses computational techniques from statistics, machine learning and pattern recognition" [1].

"The nontrivial extraction of implicit, previously unknown, and potentially useful information from data" [2].

"The science of extracting useful information from large data sets or databases" [3].
The above definitions highlight some key elements associated with the data mining activity:
  • Automatic discovery of patterns
  • Discovery of patterns that are not easy to detect (non-trivial)
  • Creation of actionable information
  • Focus on large data sets and databases
In summary, data mining is about finding patterns in data that are not easily spotted. Many times query and reporting has been incorrectly called data mining. But to its true nature, data mining implies building models, thus the notion of automatic discovery. Data mining models create abstractions (actionable information) of the data that can be used to answer questions that one would not be able to ask the data directly using just a query.

Data Mining and Statistics
There is a great deal of overlap between data mining and Statistics. In fact most of the techniques used in data mining can be placed in a statistical framework. However, data mining techniques are not amongst traditional statistical techniques. This can lead to the impression that data mining and statistics are competing disciples. Traditional statistical method, in general, require a great deal of user interaction in order to validate the correctness of the model, and thus they are harder to automate. These methods also do not usually scale well to very large datasets. On the other hand, data mining methods are suitable for large datasets and can be more readily automated. In fact, data mining algorithms, in many cases, require large data sets for the creation of quality models.

Data Mining and OLAP
On-Line Analytical Processing (OLAP) can been defined as fast analysis of shared multidimensional data [4]. OLAP and data mining are different but complementary activities. OLAP analysis may include time series analysis, cost allocations, currency translation, goal seeking, ad hoc multidimensional structural changes, non-procedural modeling, exception alerting, and data mining. However, most OLAP systems do not have inductive inference (or data mining) capabilities beyond the support for time-series forecast.

Multidimensional data is a key concept in OLAP. OLAP systems provide a multidimensional conceptual view of the data, including full support for hierarchies and multiple hierarchies. This view of the data is a natural way to analyze businesses and organizations. Data mining systems (and models), on the other hand, usually do not have a concept of dimensions and hierarchies.

Data mining and OLAP can be integrated in a number of ways. For example, data mining can be used to select the dimensions for a cube, create new attributes for a dimension, and create new measures for a cube. OLAP can be used to analyze data mining results at different levels of granularity.

In future posts I will describe how to mine OLAP data and to analyze data mining results using OLAP techniques.

Readings: Business intelligence, Data mining, Oracle analytics

Marcos
How would you differentiate between Predictive Analytics and Data Mining? Would you do so at all? I'd also love to hear how you see data mining being integrated into the decision service concept in BPEL.
JT

JT
Predictive Analytics has been used in a number of different contexts. In many cases it is just another name used for data mining. I kind of favor the use of Predictive Analytics to mean packaged analytics either within an application or as a high-level component. In either case the key aspect is that it can be used by people with little or no knowledge of advanced analytics. This is the view we have taken when creating the DBMS_PREDICTIVE_ANALYTICS package and Excel add-in. These products encapsulate whole methodologies and do not require any specialized analytical know-how. Just point to a data source and press go.

In summary, I would use Data Mining for the set of low-level techniques and methodologies that enable analytical applications and Predictive Analytics for a set of high-level applications or components that use Data Mining.

Regarding BPEL, I think that Data Mining (DM) can play a very useful role. in a decision service. The range of applications is quite wide: data validation and enrichment, differentiated processing based on the detection of anomalous behavior, credit scoring, etc.

Unless we are writing applications against a single DM vendor, in order to facilitate leveraging DM in BPEL, it is useful to have well-defined Web services APIs for DM. Currently the two main standards are: JDMWS (Java) and XMLA. Both standards cover the whole array of Data Mining functionality. I personally think that, for Web services, we are would be more interested in scoring than model building. In this case these standards are overkill.

JT
Just a quick note to complement the previous comment. If we are writing a dedicated application then the discussion on Web services APIs and standards is mute. For example, using Oracle BPEL we can leverage PL/SQL procedures (including Data Mining ones) easily.

Post a Comment

Links to this post

Create a Link

About me

  • Marcos M. Campos: Development Manager for Oracle Data Mining Technologies. Previously Senior Scientist with Thinking Machines. Over the years I have been working on transforming databases into easy to use analytical servers.
  • My profile

Disclaimer

  • Opinions expressed are entirely my own and do not reflect the position of Oracle or any other corporation. The views and opinions expressed by visitors to this blog are theirs and do not necessarily reflect mine.
  • This work is licensed under a Creative Commons license.
  • Creative Commons License

Email-Digest



Feeds

Search


Posts

All Posts

Category Cloud

Links

Locations of visitors to this page
Powered by Blogger
Get Firefox