|
|
Discretise Node |
The wizard-driven discretise node allows you to group together similar values within continuous numeric fields. For example, you could use discretisation to group temperature values into three partitions (or bins) representing low, medium and high. The continuous values within the field would be replaced by the discrete values 0,1 and 2 respectively and the field retains its original ordering. The type of the field would change from continuous numeric to discrete numeric.
The approaches to discretisation can be subdivided into two categories:Automatic discretisation mode: three methods are provided to partition the values from one or more fields into a number of bins. The methods available form equal-width bins or bins with an (approximate) equal-frequency of records. Alternatively, an optimal algorithm can be used to perform an optimal partitioning based on a specified distance metric; however, the complexity of this algorithm means that it is suitable for fields with typically fewer than 1000 records. Once the automatic discretisation has been performed, the boundaries of each partitioned are reported.
Manual discretisation mode: you can use this mode to partition the data with user-defined boundaries (for example, age bands: up to 20, 20-40, 40-60, 60+). The manual mode is also useful when the automatic mode has been used to discretise the training set and the same discretisation is required on a testing set (applying the algorithm directly to the testing database would yield a different set of boundaries). The boundary information reported from the training set can be used to recreate the required partitions on the training set. After discretisation, the values in the fields will consist of contiguous integer values which can be converted to categorical data for mining purposes.
The discretise node is an intermediate node (that is, it appears in the middle of a stream). When the discretisation algorithms have finished executing, the calculated split points which form the boundaries of the bins within each field are displayed in the log window. When working on training and testing datasets, the split points from the training set should be recorded so that an identical discretisation can be performed on the testing set using manual discretisation.
An example of the split points results could be:
0 = 8.3), 1 = [8.3 - 12.9), 2 = [12.9
which represents the following discrete bins.
|
Bin ID |
Range of Values |
|
0 |
< 8.3 |
|
1 |
>= 8.3 and < 12.9 |
|
2 |
>= 12.9 |
The options for discretisation are set by using the discretisation wizard. Full details can be found starting on the discretisation - general options page.