KDD Methodology

 

KDD stands for Knowledge Discovery in Databases. It is the concept of extracting previously unknown and potentially useful information from large sets of data.

 

The general capabilities of the system can be summarised by the stages of the KDD Roadmap1. The outline of this process is illustrated below.

KDD Roadmap Outline

 

PROBLEM SPECIFICATION

Much of work required during this stage is related to setting up the data mining project and knowing what information is available and what must be obtained. The stage introduces the concept of a log which records the decisions made at each stage of the KDD process; this log is implemented in the toolkit to describe the options and results of applying each node in the stream. Tools to generate summary statistics and visualise the database are available to aid familiarization.

RESOURCING

Resourcing is the stage that gathers all information, materials, equipment and personnel required to carry out the KDD project.

DATA CLEANSING

The package provides a number of tools to help cleanse the format of the database. These are processes that will normally be applied only once to the database, unlike the pre-processing stages where parameter setting becomes important and the process is iterative. Such cleansing techniques are record sampling (random, 1 in n) and working with missing data.

PRE-PROCESSING

Pre-processing techniques are very powerful functions that can dramatically change the shape and format of the database. Divided into two main areas, feature selection removes data by selecting only the most powerfully predictive fields (field sampling); and discretisation converts numeric continuous data to numeric discrete (one form of value sampling).

DATA MINING

The techniques that WITNESS Miner uses are suitable for solving most different types of data mining problems (for example: classification, clustering and so on). However, it can be particularly powerful to have the ability to target problems where rules are required to predict records for a particular class. WITNESS Miner uses a simulated annealing data mining engine to generate rules describing these classes of interest.

EVALUATION

Processes to build and evaluate rules are included as part of a collection of evaluation tools. The rules can be evaluated against new databases for testing purposes whilst the components of the rule can be 'tweaked' to determine the sensitivity and interest of the rule. Visualisation methods such as scatter plots and histograms are also available.


  1. J.C.W. Debuse, B. de la Iglesia, C.M. Howard and V.J. Rayward-Smith.

    "A methodology for Knowledge Discovery: The KDD Roadmap.
    University of East Anglia, School of Information Systems Technical Report SYS-C99-01.