Logistic Regression Learner
Performs a multinomial logistic regression. Select in the dialog a
target column (combo box on top), i.e. the response. The two
lists in the center of the dialog allow you to include only certain
columns which represent the (independent) variables.
Make sure the columns you want to have included being in the right
"include" list.
See article in wikipedia about
logistic regression
for an overview about the topic.
This particular implementation uses an iterative optimization procedure
termed Fisher's scoring in order to compute the model.
If the optional PMML inport is connected and contains
preprocessing operations in the TransformationDictionary those are
added to the learned model.
Potential Errors and Error Handling
The computation of the model is an iterative optimization process that requires some properties of the data set.
This requires a reasonable distribution of the target values and non-constant, uncorrelated columns. While
some of these properties are checked during the node execution you may still run into errors during the
computation. The list below gives some ideas what might go wrong and how to avoid such situations.
-
Insufficient Information This is the case when the data does not provide enough information about
one or more target categories. Try to get more data or remove rows for target categories that may cause
the error. If you are interested in a model for one target category make sure to group the target
column before. For instance, if your data contains as target categories the values "A", "B", ..., "Z" but
you are only interested in getting a model for class "A" you can use a rule engine node to convert your
target into "A" and "not A".
-
Violation of Independence Logistic Regression is based on the assumption of statistical independence.
A common preprocessing step is to us a correlation filter to remove highly correlated learning columns.
Use a "Linear Correlation" along with a "Correlation Filter" node to remove redundant columns, whereby often
it's sufficient to compute the correlation model on a subset of the data only.
-
Separation Please see this article
about separation for more information.
Dialog Options
- Target
-
Select the target column. Only columns with nominal data are allowed. The reference category is empty
if the domain of the target column is not available. In this case the node determines the domain values right
before computing the logistic regression model and chooses the last domain value as the targets reference
category.
By default the target domain values are sorted lexicographically in the output, but you can enforce the
order of the target column domain to be preserved by checking the box.
Note, if a target reference column is selected in the dropdown, the checkbox will have no influence on the
coefficients of the model except that the output representation (e.g. order of rows in the coefficient table)
may vary.
- Values
-
Specify the independent columns that should be included in the regression model.
Numeric and nominal data can be included.
By default the domain values (categories) of nominal valued columns are sorted lexicographically,
but you can check that the order from the column domain is used. Please note that the first
category is used as a reference when creating the
dummy variables.
Ports
Input Ports
0 |
Table on which to perform regression. The input must not contain missing values, you have to fix them by e.g. using the Missing Values node. |
1 |
Optional PMML port object containing preprocessing operations. |
Output Ports
0 |
Model to connect to a predictor node. |
1 |
Coefficients and statistics of the logistic regression model. |
Views
- Logistic Regression Result View
-
Displays the estimated coefficients and error statistics. Note,
that the estimated coefficients are not reliable when the standard
error is high.
This node is contained in KNIME Base Nodes
provided by KNIME GmbH, Konstanz, Germany.