Overfitting in induction classification trees
Overfitting in induction classification trees
- boundaries created to separate different classes in data
- like lines drawn parallel to the axes of the features
- goal make these boundaries so that each class is completely separated.
- every point in one class is on one side of the boundary, and every point in the other class is on the other side.
- Induction process continues until all subsets associated with leaf nodes have an entropy of 0, i.e. are perfectly homogeneous
- can lead to overfitting
- Outliers and noise will be in leaves, having very small data subsets
Pruning: overfitting solution
- see also: pruning in neural networks
- Tests in deepest levels are replaced with leaf node
- Subsets associated with the test's leaf nodes are combined
- Class of leaf node is class that occurs most in the new subset
- If generalization performance improves, then pruning is affected
- If generalization performance does not improve, then test is re-instated
- Makes decision trees robust to outliers and noise
Overfitting from information gain
- Favours input variables with many outcomes
- results in many branches
- many decision rules
- a complex classifier
- For input variables with unique values: one branch per value Solution to this bias:
- Normalize the information gain with entropy with respect to test outcomes
- Gain ratio is computed as , where
- Objective to select that maximizes the gain ratio