KDD 2006 - Day One
KDD concentrates most of the tutorials and workshops on the first day. In previous years I usually jumped around from room to room trying to catch interesting talks. This year I decided to follow a different strategy. I picked a full day workshop and stuck with it for the day. I chose the Data Mining for Business Applications Workshop organized by Rayid Ghani (Accenture Technology Labs) and Carlos Soares (University of Porto). It seems that I picked well. The meeting, in one of the larger rooms, was quite full and the audience was very participative. The panel discussions were very lively. I also got to hook up with the rest of the Oracle team and got an early look on our presence on the exhibit hall. Pretty cool stuff, more on that on the next post.
There were many interesting talks and discussions. I will summarize the high points for me. The first talk I saw, A Boosting Approach to Automated Trading (Creamer & Freund), described an automated approach for trading using machine learning. The results were very good. One of the questions from the audience captured something that always comes to mind when I see a talk like that: If this system is so good why is someone talking about it instead of making money with it? The answer came at many levels. First there is the desire to make public one's research. Second the presented system was a simplified system. A real system would have higher scalability requirements. But the key issue was a matter of trust. Would you risk your money on an automated machine learning approach? Trust was a theme that recurred during the whole workshop.
The next talk, A Decision Management Approach to Basel II Compliant Credit Risk Management (van de Putten, et al.), stressed the need to combine machine learning/data mining approaches with rule-based ones that codifies knowledge from the user. The combination of data mining and rules are essential for success. A key role played by data mining in this context is the estimation of the probability of default on a loan. This can have a very big impact on how much money banks are required to set aside under the Basel II rules. This seems to be an interesting problem for trying a combination of Oracle technologies, namely: Oracle Data Mining and Oracle's rules engine technologies. The speaker proposed many areas for future research. When I asked him about where to find real data for meaningful research he acknowledged that there are no publicly available data sets. I see this as one of the key factors preventing the development of applications using data mining. Without meaningful and realistic data sets publicly available, research progresses very slowly. Only a small number of researchers that have access to proprietary data can make contributions. The data mining research community at large is left untapped as a resource to find solutions to pressing problems. Furthermore, if realistic data were available, it would be easier to develop prototypes that showcase the value of the technology to end-users in a compelling way. This could have a very positive impact in increasing the adoption of analytics in applications. The lack of available data was another recurrent topic in the workshop.
Another interesting talk was Discovering Telecom Fraud Situations Through Mining Anomalous Behavior Patterns (Alves, et al.). The authors used clustering as the modeling technique. I wonder if better results cannot be obtained using a technique like one-class support vector machines included in Oracle Data Mining. Again the data is not publicly available.
The talk Interactivity Closes the Gap: Lessons Learned in an Automotive Industry Application (Blumenstock, et al.) discussed a human in the loop approach. Users interact with a data mining tool to gradually build a solution that they trust. During the discussion the issue of accuracy vs. transparency came up. Practitioners feel that in the earlier stages of a project transparency considerations dominate (the trust issue again). In later stages, as the user starts to trust the technology, accuracy becomes more important even if transparency is lost.
The panel discussions were very lively. The first panel addressed Bridging the Gap Between Data Mining Research and Practical Business Applications (Ronny Kohavi, Karl Rexer, and Galit Shmueli). Kohavi contrasted his experiences at Amazon and Microsoft. Two companies with very different styles regarding research. Research at Amazon is very focused and secretive. Research at Microsoft has a broader scope and is very open. Amazon approach makes it easier to turn research into products. Microsoft approach is better at recruiting talented researchers. Shmueli proposed that MBA students are good material for data mining recruitment. She has also observed an increase in the number of MBAs taking courses in analytics. Rexer commented that he works with many large companies that do not have a dedicated analytical group. He proposed that we need to focus training on how to use data mining in business instead of data mining per se. I agree with that. There is too much focus on techniques and very little discussion on business uses.
The second panel discussion, Deploying Data Mining Solutions: Stories, Challenges, and Open Issues (Tyler Kohn, Ramin Mikaili, Richard Boire, and Françoise Fogelman) covered a number of interesting use cases. Mikalli presented a framework used by Accenture to solve problems with good success on some challenging problems. Boire showed how a series of real-life business concerns were addressed with data mining. He also highlighted how easy is to obtain incorrect results when one is not careful applying data mining techniques. Fogelman presented the most controversial talk. She articulates the need for automating data mining in order to empower a larger number of users and eliminate the analytical bottleneck. I liked here talk quite a bit as I am a strong believer that automation of data mining is not only a necessity but unavoidable. However, automating data mining is a hard sell to an audience of data mining experts. The topic generated a very animated discussion. The usual reaction to automation is to list a number of situations where it would be very hard to do it. But that is to miss the point. There will always be problems that need experts. Automation allows users to solve the simpler problems by themselves. It frees the experts to solve the hard problems. As a result the overall utilization of analytics in a companies increases. This also makes the value of the technology more tangible to end users. As users better understand the value of data mining, the number of jobs available to experts should also increase (remember that there are always tough problems to solve).