Index     


Input data types

You can read input data from two different kinds of sources:
  • from sd-files, which contain the structure of chemical compounds
  • from plain-text files, that contain comma (or tab) seperated values

The sourcefile browser thus shows all files with extensions *.sdf, *.csv and *.txt.
In order to read input, just drag an item from the source-file browser to your pipeline-view.

SD-files

To use an an sd-file as input, drag it into the pipeline area (a).
You will be shown a preview of properties within this file (b), of which one or more can be selected as response variable (c). Note, that all molecules within the sd-file need to have the same properties.
If all properties which have not been selected as response variables (properties 1, 2 and 3 in the example shown above) are to be used as descripors, you can check box (d).
In case of data sets for classification that contain strings or characters as class-labels, (e) should be checked.
If (f) resp. (g) is checked, each descriptor resp. each response variable is centered to a mean of zero and standard deviation of one. While it is nearly always desireable to center the descriptors, in order to give each feature the same a priori chance to contribute to the response, make sure not to center the response variable in case of data sets intended for classification.

60 descriptors will later be calculated automatically for all molecules in this file. While these alone may allow sufficient modelling of easy data sets, it is often very desireable to import a larger set of desciptors. See below on how to do this.

CSV-files

To use an an csv-file as input, drag it into the pipeline area. If you want to use this file in order to read additional descriptors, that have been previously computed by e.g. the program dragon, just drag the csv-file from the sourcefile-browser onto the sd-item (a).
Then specify the number of response variable that the file contains (b) and the separator symbol used in it (c). Note, that response variables, if present, need to be located in the last columns of the table.
If the first line of the file contains labels for the descriptors, check (d); if the first column of it contains the names of the compounds, check (e).
Options (f)-(h) are equal to options (e)-(g) described above for sd-files.


    Index