SMOTE
This node oversamples the input data (i.e. adds artificial rows) to enrich the
training data. The applied technique is called
SMOTE (Synthetic Minority Over-sampling Technique) by
Chawla et al.
Some supervised learning algorithms (such as decision trees and neural nets)
require an equal class distribution to generalize well, i.e. to get good
classification performance. In case of unbalanced input data,
for instance there are only few objects of the "active" but
many of the "inactive" class, this node adjusts the class distribution
by adding artificial rows (in the example by adding rows for the "active" class).
The algorithm works roughly as follows: It creates synthetic rows by extrapolating
between a real object of a given class (in the above example "active")
and one of its nearest neighbors (of the same class). It then picks a point
along the line between these two objects and determines the attributes (cell values)
of the new object based on this randomly chosen point.
Dialog Options
- Class Column
-
Pick the column that contains the class information.
- Nearest neighbor
-
An option that determines how many nearest neighbors shall be considered.
The algorithm picks an object from the target class, randomly selects
one of its neighbors and draws the new synthetic example along the
line between the sample and the neighbor.
- Oversample by
-
Checking this option oversamples each class equally. You need to
specify how much synthetic data is introduced, e.g. a value of 2
will introduce two more portions for each class (if there are 50
rows in the input table labeled as "A"; the output will contain
150 rows belonging to "A").
- Oversample minority classes
-
This option adds synthetic examples to all classes that are
not the majority class. The output contains the same number
of rows for each of the possible classes.
- Enable static seed
-
Check this option if you want to use a seed for the random number
generator. This will cause consecutive runs of the node to produce
the same output data. If unchecked, each run of the node generates
a new seed. Use "Draw new seed" to randomly draw a new seed.
Ports
Input Ports
0 |
Table containing labeled data for oversampling. |
Output Ports
0 |
Oversampled data (input table with appended rows). |
This node is contained in KNIME Base Nodes
provided by KNIME GmbH, Konstanz, Germany.