Data Source Node 

Data Source Node

The data source node loads data into WITNESS Miner. Data can be loaded either from CSV formatted text files or from OLE DB compliant data providers (formats commonly used by KDD practitioners and by other software packages).

The text file can only load data in flat file format but is the fastest method of loading data.

The support for OLE-DB extracts data from any source that can be represented in tabular form and for which provider code has been written (this is standard for most major DBMSs). The added benefits of OLE-DB are that it can extract data from non-database applications, such as Excel or an email client, and combine data from multiple sources/tables using SQL.

The node is a source node; this means that it is always the first node in a stream (and therefore there are no links to it) but you can link several nodes from it.

 

Text (Flat) File format

Files are loaded from comma separated values (CSV) in standard text format.

Field names

The field names should be loaded from the first line of the main datafile. When fields contain categorical data (represented by either text or numbers) these fields can be forced to read categorical data by prefixing the fieldname with an asterisk (*). You can see an example of this below.

The maximum length of a fieldname is 128 characters. WITNESS Miner truncates remaining characters when it loads the field names. The field name should not contain spaces; if WITNESS Miner encounters any spaces during the loading process, it replaces them with an underscore character.

Field types

WITNESS Miner recognizes three main data types.

Type

Example field

Example data

Continuous numeric

salary, temperature

0.124, 3.2e03

Discrete numeric

age, number cars

-2, -1, 0, 34591

Categorical

color, greater than 50

red, blue, TRUE, FALSE

Identify categorical data by prefixing the field name with an asterisk (as described above). If WITNESS Miner encounters any spaces during the loading process, it replaces them with an underscore character. You do not need to enclose text in quotations marks.

As with field names, the maximum length of a categorical string is 128 characters.

Special case and encoding

When dealing with certain field types, you should be aware of the implications of the semantics of the data:

  1. Categorical data encoded as discretic numeric values

    Example: a field contains data such as 1,2 and 3, which is an encoding of 1 = red, 2 = blue and 3 = green. The values 1,2 and 3 are not ordered according to the semantics of the data, and you should place an asterisk character before the field name to indicate that it is categorical data.

  2. Categorical data with an ordering

    When categorical text data contains low, medium and high these values have an ordering that could be important in the search for rules. These values should be encoded as the numbers 0,1 and 2 respectively and treated as discrete data.

Missing values

There is no special notation or constant to represent missing values. When reading data from a file a missing value is recorded if there is no characters between one delimiter and the next. See the section below for an example. Missing values must not appear in the target field.

Target field

The target, or output field is by default taken to be the last field in the database. If the target field does not appear at the end of the database it can be set using the reorder node.

Text File Example

The following example illustrates how the data appears in the plain text file and how it is interpreted by the data source node.

Data in CSV file

age,*eye_color,height,*sex
21,blue,1.70,M
26,brown,,F
19,blue,1.68,M
etc.

Data loaded using data source node

Field name

age
(Discrete numeric)

eye_color
(Categorical)

height
(Continuous numeric)

sex
(Categorical)

Record 1

21

blue

1.70

M

Record 2

26

brown

Missing

F

Record 3

19

blue

1.68

M

Databases via OLE DB

OLE DB is a set of interfaces for accessing any data source that can be represented in tabular format and for which a provider has been written. A number of standard formats, including Jet and ODBC, were set up during the installation process. The most efficient method of accessing data sources is through a specially written provider, such as the one provided for the Jet Database Engine in Microsoft Access. If such a provider is not available, the provider for ODBC data sources will allow connections to databases through the standard ODBC method.

OLE DB Example

This example shows how to connect to the customer response sample database (in Microsoft Access 97 format).

For more information on OLE DB see http://www.microsoft.com/data/

Options

Full details of the options available for the data source node can be found on the data source options page.