# Overfitting in induction classification trees

## Overfitting in induction classification trees

- boundaries created to separate different classes in data
- like lines drawn parallel to the axes of the features

- goal $\to$ make these boundaries so that each class is completely separated.
- every point in one class is on one side of the boundary, and every point in the other class is on the other side.

- Induction process continues until all subsets associated with leaf nodes have an entropy of 0, i.e. are perfectly homogeneous
- can lead to overfitting
- Outliers and noise will be in leaves, having very small data subsets

### Pruning: overfitting solution

- see also: pruning in neural networks
- Tests in deepest levels are replaced with leaf node
- Subsets associated with the test's leaf nodes are combined
- Class of leaf node is class that occurs most in the new subset
- If generalization performance improves, then pruning is affected
- If generalization performance does not improve, then test is re-instated
- Makes decision trees robust to outliers and noise

## Overfitting from information gain

- Favours input variables with many outcomes
- results in many branches
- $\therefore$ many decision rules
- $\therefore$ a complex classifier

- For input variables with unique values: one branch per value
**Solution to this bias:** - Normalize the information gain with entropy with respect to test outcomes
- Gain ratio is computed as $\text { gainRatio }(x)=\frac{\operatorname{gain}(x)}{\operatorname{split\operatorname {lnfo}(x)}}$, where $\text{splitInfo}(x)=-\sum_{o=1}^O p_o \log o p_o$
- Objective to select $x$ that maximizes the gain ratio